Limitations and Workarounds
Overlapping write operations or simultaneous read and write operations require proper synchronization when using UnifyFS. This includes ensuring updates to a file are visible to other processes as well as inter-process communication to enforce ordering of conflicting I/O operations. Refer to the section on commit consistency semantics in UnifyFS for more detail.
In short, for a process to read data written by another process, the reader must wait for the writer to first flush any data it has written to the UnifyFS servers. After the writer flushes its data, there must be a synchronization operation between the writer and the reader processes, such that the reader does not attempt to read newly written data until the writer has completed its flush operation.
UnifyFS can be configured to flush data to servers at various points.
A common mechanism to flush data is for the writer process to
Also, by default, data is flushed when a file is closed
UnifyFS can be configured to behave more “POSIX like” by
flushing newly written data to the server during every write operation.
To do this, one can set
UNIFYFS_CLIENT_WRITE_SYNC=ON can decrease write performance
as the number of data flush operations may be more than necessary.
UnifyFS does not support file locking,
and calls to
flock() are not intercepted by UnifyFS.
Any calls fall through to the underlying operating system,
which should report the corresponding file descriptor as invalid.
If not detected, an application will encounter data corruption
if it depends on file locking semantics for correctness.
Tracing application I/O calls with VerifyIO can
help determine whether any file locking calls are used.
UnifyFS does not support directory operations.
When using MPI-I/O without atomic file consistency,
the MPI standard requires the application to manage
data consistency by calling
After data has been written, the writer must call
There must then be a synchronization operation between
the writer and reader processes.
Finally, the reader must call
after its synchronization operation with the writer.
A common approach is for the application to execute a
“sync-barrier-sync” construct as shown below:
MPI_File_sync() //flush newly written bytes from MPI library to file system MPI_Barrier() //ensure all ranks have finished the previous sync MPI_File_sync() //invalidate read cache in MPI library
The “barrier” in “sync-barrier-sync” can be replaced by a send-recv or certain collectives that are guaranteed to be synchronized. The synchronization operation does not even need to be an MPI call. See the “Note on the third step” in the VerifyIO README for more information.
Proper data consistency synchronization is also required
between MPI-I/O calls that imply write or read operations.
act as write operations,
MPI_File_get_size() acts as a read operation.
There may be other MPI-I/O calls that imply write or read operations.
Relaxed MPI_File_sync semantics
Data consistency in UnifyFS is designed to be compatible
with MPI-I/O application-managed file consistency semantics.
An application that follows proper MPI-I/O file consistency
MPI_File_sync() should run correctly on UnifyFS,
provided that the
MPI_File_sync() implementation flushes
newly written data to UnifyFS.
On POSIX-compliant parallel file systems like Lustre,
many applications can run correctly
even when they are missing sufficient file consistency synchronization.
In contrast, to run correctly on UnifyFS, an application should make
MPI_File_sync() calls as required by the MPI standard.
It may be labor intensive to identify and correct all places within an application where file synchronization calls are required. The VerifyIO tool can assist developers in this effort.
In the current UnifyFS implementation,
it is actually sufficient to make a single call to
MPI_File_sync() followed by
a synchronizing call like
then information about any newly written data
will be transferred to the UnifyFS servers.
MPI_Barrier() then ensures that
fsync() will have been called
by all clients that may have written data.
MPI_Barrier(), a process may read data from UnifyFS
that was written by any other process before that other process
A second call to
MPI_File_sync() is not (currently) required in UnifyFS.
MPI_File_sync() is known to be a synchronizing collective,
then a separate synchronization operation like
MPI_Barrier() is not required.
In this case, an application might simplify to just the following:
Having stated those exceptions, it is best practice to adhere to the MPI standard and execute a full sync-barrier-sync construct. There exist potential optimizations such that future implementations of UnifyFS may require the full sequence of calls.
each of which flush information about newly
written data to the UnifyFS servers.
When using ROMIO, an application having appropriate
“sync-barrier-sync” constructs as required by the
MPI standard will run correctly on UnifyFS.
ROMIO Synchronizing Flush Hint
MPI_File_sync() is an MPI collective,
it is not required to be synchronizing.
One can configure ROMIO such that
is also a synchronizing collective.
To enable this behavior, one can set the following ROMIO hint
MPI_Info object or within
a ROMIO hints file:
This configuration can be useful to applications that
MPI_File_sync() once rather than execute
a full sync-barrier-sync construct.
This hint was added starting with the ROMIO version available in the MPICH v4.0 release.
ROMIO Data Visibility Hint
Starting with the ROMIO version available in the MPICH v4.1 release,
the read-only hint
romio_visibility_immediate was added to inform
the caller as to whether it is necessary to call
to manage data consistency.
One can query the
MPI_Info associated with a file.
If this hint is defined and if its value is
then the underlying file system does not require the sync-barrier-sync
construct in order for a process to read data written by another process.
Newly written data is visible to other processes as soon as the writer
process returns from its write call.
If the value of the hint is
false, or if the hint is not defined
MPI_Info object, then a sync-barrier-sync construct is
When using UnifyFS, an application must call
in all situations where the MPI standard requires it.
However, since a sync-barrier-sync construct is costly on some file systems,
and because POSIX-complaint file systems may not require it for correctness,
one can use this hint to conditionally call
MPI_File_sync() only when
required by the underlying file system.
ROMIO requires file locking with
fcntl() to implement various functionality.
fcntl() is not supported in UnifyFS,
one must avoid any ROMIO features that require file locking.
MPI-I/O Atomic File Consistency
fcntl() to implement atomic file consistency.
One cannot use atomic mode when using UnifyFS.
Provided an application still executes correctly without atomic mode,
one can disable it by calling:
Atomic mode is often disabled by default in ROMIO.
fcntl() to support its data sieving optimization.
One must disable ROMIO data sieving when using UnifyFS.
To disable data sieving, one can set the following ROMIO hints:
romio_ds_read disable romio_ds_write disable
These hints can be set in the
MPI_Info object when opening a file,
MPI_Info info; MPI_Info_create(&info); MPI_Info_set(info, "romio_ds_read", "disable"); MPI_Info_set(info, "romio_ds_write", "disable"); MPI_File_open(comm, filename, amode, info, &fh); MPI_Info_free(&info);
or the hints may be listed in a ROMIO hints file, e.g.,:
>>: cat romio_hints.txt romio_ds_read disable romio_ds_write disable >>: export ROMIO_HINTS="romio_hints.txt"
HDF5 uses MPI-I/O. In addition to restrictions that are specific to HDF5, one must follow any restrictions associated with the underlying MPI-I/O implementation. In particular, if the MPI library uses ROMIO for its MPI-I/O implementation, one should adhere to any limitations noted above for both ROMIO and MPI-I/O in general.
When running HDF5 on ROMIO or on other MPI-I/O implementations
where these MPI routines flush newly written data to UnifyFS,
one must invoke these HDF5 functions to properly manage data consistency.
When using HDF5 with the MPI-I/O driver,
for a process to read data written by another
process without closing the HDF file,
the writer must call
H5Fflush() after writing its data.
There must then be a synchronization operation between
the writer and reader processes.
Finally, the reader must call
after the synchronization operation with the writer.
This executes the sync-barrier-sync construct as required by MPI.
H5Fflush(...) MPI_Barrier(...) H5Fflush(...)
MPI_File_sync() is a synchronizing collective, as with
when enabling the
romio_synchronizing_flush MPI-I/O hint,
then a single call to
H5Fflush() suffices to accomplish
the sync-barrier-sync construct:
Starting with the HDF5 v1.13.2 release,
HDF can be configured to call
after every HDF collective write operation.
This configuration is enabled automatically if MPI-I/O
romio_visibility_immediate hint as
One can also enable this option manually by setting the
Enabling this option can decrease write performance
since it may induce more file flush operations than necessary.
PnetCDF applications can utilize UnifyFS, and the semantics of the PnetCDF API align well with UnifyFS constraints.
PnetCDF uses MPI-IO to read and write files. In addition to any restrictions required when using UnifyFS with PnetCDF, one must follow any recommendations regarding UnifyFS and the underlying MPI-IO implementation.
PnetCDF parallelizes access to NetCDF files using MPI. An MPI communicator is passed as an argument when opening a file. Any collective call in PnetCDF is global across the process group associated with the communicator used to open the file.
PnetCDF follows the data consistency model defined by MPI-IO. Specifically, from its documentation about PnetCDF data consistency:
PnetCDF follows the same parallel I/O data consistency as MPI-IO standard.
If users would like PnetCDF to enforce a stronger consistency,
they should add
NC_SHARE flag when open/create the file.
By doing so, PnetCDF adds
MPI_File_sync() after each MPI I/O calls.
NC_SHARE is not set, then users are responsible for their
desired data consistency. To enforce a stronger consistency,
users can explicitly call
MPI_Barrier() are called.
Upon inspection of the implementation of the PnetCDF v1.12.3 release, the following PnetCDF functions include the following calls:
ncmpio_file_sync - calls MPI_File_sync(ncp->independent_fh) - calls MPI_File_sync(ncp->collective_fh) - calls MPI_Barrier ncmpio_sync - calls ncmpio_file_sync ncmpi__enddef - calls ncmpio_file_sync if NC_doFsync (NC_SHARE) ncmpio_enddef - calls ncmpi__enddef ncmpio_end_indep_data - calls MPI_File_sync if NC_doFsync (NC_SHARE) ncmpio_redef - does *NOT* call ncmpio_file_sync ncmpio_close - calls ncmpio_file_sync if NC_doFsync (NC_SHARE) - calls MPI_File_close (MPI_File_close calls MPI_File_sync by MPI standard)
If a program must read data written by another process, PnetCDF users must do one of the following when using UnifyFS:
Add explicit calls to
ncmpi_sync()after writing and before reading.
UNIFYFS_CLIENT_WRITE_SYNC=1, in which case each POSIX write operation invokes a flush.
NC_SHAREwhen opening files so that the PnetCDF library invokes
MPI_Barrier()calls after its MPI-IO operations.
Of these options,
it is recommended that one add
ncmpi_sync() calls where necessary.
UNIFYFS_CLIENT_WRITE_SYNC=1 is convenient since one does not
need to change the application, but it may have a larger impact on performance.
Opening or creating a file with
NC_SHARE may work for some applications,
but it depends on whether the PnetCDF implementation
MPI_File_sync() at all appropriate places,
which is not guaranteed.
A number of PnetCDF calls invoke write operations on the underlying file.
In addition to the
ncmpi_put_* collection of calls
that write data to variables or attributes,
ncmpi_enddef updates variable definitions,
and it can fill variables with default values.
Users may also explicitly fill variables by calling
One must ensure necessary
ncmpi_sync() calls are placed between
any fill and write operations in case
they happen to write to overlapping regions of a file.
but it does not call
MPI_File_sync() again after calling
To execute a full sync-barrier-sync construct,
one technically must call
// to accomplish sync-barrier-sync ncmpi_sync(...) // call MPI_File_sync and MPI_Barrier ncmpi_sync(...) // call MPI_File_sync again
When using UnifyFS,
a single call to
ncmpi_sync() should suffice since UnifyFS
does not (currently) require the second call to
as noted above.