Limitations and Workarounds
General Limitations
Data Consistency
Overlapping write operations or simultaneous read and write operations require proper synchronization when using UnifyFS. This includes ensuring updates to a file are visible to other processes as well as inter-process communication to enforce ordering of conflicting I/O operations. Refer to the section on commit consistency semantics in UnifyFS for more detail.
In short, for a process to read data written by another process, the reader must wait for the writer to first flush any data it has written to the UnifyFS servers. After the writer flushes its data, there must be a synchronization operation between the writer and the reader processes, such that the reader does not attempt to read newly written data until the writer has completed its flush operation.
UnifyFS can be configured to flush data to servers at various points.
A common mechanism to flush data is for the writer process to
call fsync()
or fflush()
.
Also, by default, data is flushed when a file is closed
with close()
or fclose()
.
UnifyFS can be configured to behave more “POSIX like” by
flushing newly written data to the server during every write operation.
To do this, one can set UNIFYFS_CLIENT_WRITE_SYNC=ON
.
UNIFYFS_CLIENT_WRITE_SYNC=ON
can decrease write performance
as the number of data flush operations may be more than necessary.
File Locking
UnifyFS does not support file locking,
and calls to fcntl()
and flock()
are not intercepted by UnifyFS.
Any calls fall through to the underlying operating system,
which should report the corresponding file descriptor as invalid.
If not detected, an application will encounter data corruption
if it depends on file locking semantics for correctness.
Tracing application I/O calls with VerifyIO can
help determine whether any file locking calls are used.
Directory Operations
UnifyFS does not support directory operations.
MPI-IO Limitations
Data Consistency
When using MPI-I/O without atomic file consistency,
the MPI standard requires the application to manage
data consistency by calling MPI_File_sync()
.
After data has been written, the writer must call MPI_File_sync()
.
There must then be a synchronization operation between
the writer and reader processes.
Finally, the reader must call MPI_File_sync()
after its synchronization operation with the writer.
A common approach is for the application to execute a
“sync-barrier-sync” construct as shown below:
MPI_File_sync() //flush newly written bytes from MPI library to file system
MPI_Barrier() //ensure all ranks have finished the previous sync
MPI_File_sync() //invalidate read cache in MPI library
Note
The “barrier” in “sync-barrier-sync” can be replaced by a send-recv or certain collectives that are guaranteed to be synchronized. The synchronization operation does not even need to be an MPI call. See the “Note on the third step” in the VerifyIO README for more information.
Proper data consistency synchronization is also required
between MPI-I/O calls that imply write or read operations.
For example, MPI_File_set_size()
and MPI_File_preallocate()
act as write operations,
and MPI_File_get_size()
acts as a read operation.
There may be other MPI-I/O calls that imply write or read operations.
Both MPI_File_open()
and MPI_File_close()
implicitly call MPI_File_sync()
.
Relaxed MPI_File_sync semantics
Data consistency in UnifyFS is designed to be compatible
with MPI-I/O application-managed file consistency semantics.
An application that follows proper MPI-I/O file consistency
semantics using MPI_File_sync()
should run correctly on UnifyFS,
provided that the MPI_File_sync()
implementation flushes
newly written data to UnifyFS.
On POSIX-compliant parallel file systems like Lustre,
many applications can run correctly
even when they are missing sufficient file consistency synchronization.
In contrast, to run correctly on UnifyFS, an application should make
all MPI_File_sync()
calls as required by the MPI standard.
Note
It may be labor intensive to identify and correct all places within an application where file synchronization calls are required. The VerifyIO tool can assist developers in this effort.
In the current UnifyFS implementation,
it is actually sufficient to make a single call to MPI_File_sync()
followed by
a synchronizing call like MPI_Barrier()
, e.g.:
MPI_File_sync()
MPI_Barrier()
Assuming that MPI_File_sync()
calls fsync()
,
then information about any newly written data
will be transferred to the UnifyFS servers.
The MPI_Barrier()
then ensures that fsync()
will have been called
by all clients that may have written data.
After the MPI_Barrier()
, a process may read data from UnifyFS
that was written by any other process before that other process
called MPI_File_sync()
.
A second call to MPI_File_sync()
is not (currently) required in UnifyFS.
Furthermore, if MPI_File_sync()
is known to be a synchronizing collective,
then a separate synchronization operation like MPI_Barrier()
is not required.
In this case, an application might simplify to just the following:
MPI_File_sync()
Having stated those exceptions, it is best practice to adhere to the MPI standard and execute a full sync-barrier-sync construct. There exist potential optimizations such that future implementations of UnifyFS may require the full sequence of calls.
ROMIO Limitations
Data Consistency
In ROMIO, MPI_File_sync()
calls fsync()
and MPI_File_close()
calls close()
,
each of which flush information about newly
written data to the UnifyFS servers.
When using ROMIO, an application having appropriate
“sync-barrier-sync” constructs as required by the
MPI standard will run correctly on UnifyFS.
ROMIO Synchronizing Flush Hint
Although MPI_File_sync()
is an MPI collective,
it is not required to be synchronizing.
One can configure ROMIO such that MPI_File_sync()
is also a synchronizing collective.
To enable this behavior, one can set the following ROMIO hint
through an MPI_Info
object or within
a ROMIO hints file:
romio_synchronizing_flush true
This configuration can be useful to applications that
only call MPI_File_sync()
once rather than execute
a full sync-barrier-sync construct.
This hint was added starting with the ROMIO version available in the MPICH v4.0 release.
ROMIO Data Visibility Hint
Starting with the ROMIO version available in the MPICH v4.1 release,
the read-only hint romio_visibility_immediate
was added to inform
the caller as to whether it is necessary to call MPI_File_sync
to manage data consistency.
One can query the MPI_Info
associated with a file.
If this hint is defined and if its value is true
,
then the underlying file system does not require the sync-barrier-sync
construct in order for a process to read data written by another process.
Newly written data is visible to other processes as soon as the writer
process returns from its write call.
If the value of the hint is false
, or if the hint is not defined
in the MPI_Info
object, then a sync-barrier-sync construct is
required.
When using UnifyFS, an application must call MPI_File_sync()
in all situations where the MPI standard requires it.
However, since a sync-barrier-sync construct is costly on some file systems,
and because POSIX-complaint file systems may not require it for correctness,
one can use this hint to conditionally call MPI_File_sync()
only when
required by the underlying file system.
File Locking
ROMIO requires file locking with fcntl()
to implement various functionality.
Since fcntl()
is not supported in UnifyFS,
one must avoid any ROMIO features that require file locking.
MPI-I/O Atomic File Consistency
ROMIO uses fcntl()
to implement atomic file consistency.
One cannot use atomic mode when using UnifyFS.
Provided an application still executes correctly without atomic mode,
one can disable it by calling:
MPI_File_set_atomicity(fh, 0)
Atomic mode is often disabled by default in ROMIO.
Data Sieving
ROMIO uses fcntl()
to support its data sieving optimization.
One must disable ROMIO data sieving when using UnifyFS.
To disable data sieving, one can set the following ROMIO hints:
romio_ds_read disable
romio_ds_write disable
These hints can be set in the MPI_Info
object when opening a file,
e.g.,:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "romio_ds_read", "disable");
MPI_Info_set(info, "romio_ds_write", "disable");
MPI_File_open(comm, filename, amode, info, &fh);
MPI_Info_free(&info);
or the hints may be listed in a ROMIO hints file, e.g.,:
>>: cat romio_hints.txt
romio_ds_read disable
romio_ds_write disable
>>: export ROMIO_HINTS="romio_hints.txt"
HDF5 Limitations
HDF5 uses MPI-I/O. In addition to restrictions that are specific to HDF5, one must follow any restrictions associated with the underlying MPI-I/O implementation. In particular, if the MPI library uses ROMIO for its MPI-I/O implementation, one should adhere to any limitations noted above for both ROMIO and MPI-I/O in general.
Data Consistency
In HDF5, H5Fflush()
calls MPI_File_sync()
and H5Fclose()
calls MPI_File_close()
.
When running HDF5 on ROMIO or on other MPI-I/O implementations
where these MPI routines flush newly written data to UnifyFS,
one must invoke these HDF5 functions to properly manage data consistency.
When using HDF5 with the MPI-I/O driver,
for a process to read data written by another
process without closing the HDF file,
the writer must call H5Fflush()
after writing its data.
There must then be a synchronization operation between
the writer and reader processes.
Finally, the reader must call H5Fflush()
after the synchronization operation with the writer.
This executes the sync-barrier-sync construct as required by MPI.
For example:
H5Fflush(...)
MPI_Barrier(...)
H5Fflush(...)
If MPI_File_sync()
is a synchronizing collective, as with
when enabling the romio_synchronizing_flush
MPI-I/O hint,
then a single call to H5Fflush()
suffices to accomplish
the sync-barrier-sync construct:
H5Fflush(...)
HDF5 FILE_SYNC
Starting with the HDF5 v1.13.2 release,
HDF can be configured to call MPI_File_sync()
after every HDF collective write operation.
This configuration is enabled automatically if MPI-I/O
defines the romio_visibility_immediate
hint as false
.
One can also enable this option manually by setting the
environment variable HDF5_DO_MPI_FILE_SYNC=1
.
Enabling this option can decrease write performance
since it may induce more file flush operations than necessary.
PnetCDF Limitations
PnetCDF applications can utilize UnifyFS, and the semantics of the PnetCDF API align well with UnifyFS constraints.
PnetCDF uses MPI-IO to read and write files. In addition to any restrictions required when using UnifyFS with PnetCDF, one must follow any recommendations regarding UnifyFS and the underlying MPI-IO implementation.
Data Consistency
PnetCDF parallelizes access to NetCDF files using MPI. An MPI communicator is passed as an argument when opening a file. Any collective call in PnetCDF is global across the process group associated with the communicator used to open the file.
PnetCDF follows the data consistency model defined by MPI-IO. Specifically, from its documentation about PnetCDF data consistency:
Note
PnetCDF follows the same parallel I/O data consistency as MPI-IO standard.
If users would like PnetCDF to enforce a stronger consistency,
they should add NC_SHARE
flag when open/create the file.
By doing so, PnetCDF adds MPI_File_sync()
after each MPI I/O calls.
If NC_SHARE
is not set, then users are responsible for their
desired data consistency. To enforce a stronger consistency,
users can explicitly call ncmpi_sync()
. In ncmpi_sync()
,
MPI_File_sync()
and MPI_Barrier()
are called.
Upon inspection of the implementation of the PnetCDF v1.12.3 release, the following PnetCDF functions include the following calls:
ncmpio_file_sync
- calls MPI_File_sync(ncp->independent_fh)
- calls MPI_File_sync(ncp->collective_fh)
- calls MPI_Barrier
ncmpio_sync
- calls ncmpio_file_sync
ncmpi__enddef
- calls ncmpio_file_sync if NC_doFsync (NC_SHARE)
ncmpio_enddef
- calls ncmpi__enddef
ncmpio_end_indep_data
- calls MPI_File_sync if NC_doFsync (NC_SHARE)
ncmpio_redef
- does *NOT* call ncmpio_file_sync
ncmpio_close
- calls ncmpio_file_sync if NC_doFsync (NC_SHARE)
- calls MPI_File_close (MPI_File_close calls MPI_File_sync by MPI standard)
If a program must read data written by another process, PnetCDF users must do one of the following when using UnifyFS:
Add explicit calls to
ncmpi_sync()
after writing and before reading.Set
UNIFYFS_CLIENT_WRITE_SYNC=1
, in which case each POSIX write operation invokes a flush.Use
NC_SHARE
when opening files so that the PnetCDF library invokesMPI_File_sync()
andMPI_Barrier()
calls after its MPI-IO operations.
Of these options,
it is recommended that one add ncmpi_sync()
calls where necessary.
Setting UNIFYFS_CLIENT_WRITE_SYNC=1
is convenient since one does not
need to change the application, but it may have a larger impact on performance.
Opening or creating a file with NC_SHARE
may work for some applications,
but it depends on whether the PnetCDF implementation
internally calls MPI_File_sync()
at all appropriate places,
which is not guaranteed.
A number of PnetCDF calls invoke write operations on the underlying file.
In addition to the ncmpi_put_*
collection of calls
that write data to variables or attributes,
ncmpi_enddef
updates variable definitions,
and it can fill variables with default values.
Users may also explicitly fill variables by calling ncmpi_fill_var_rec()
.
One must ensure necessary ncmpi_sync()
calls are placed between
any fill and write operations in case
they happen to write to overlapping regions of a file.
Note that ncmpi_sync()
calls MPI_File_sync()
and MPI_Barrier()
,
but it does not call MPI_File_sync()
again after calling MPI_Barrier()
.
To execute a full sync-barrier-sync construct,
one technically must call ncmpi_sync()
twice:
// to accomplish sync-barrier-sync
ncmpi_sync(...) // call MPI_File_sync and MPI_Barrier
ncmpi_sync(...) // call MPI_File_sync again
When using UnifyFS,
a single call to ncmpi_sync()
should suffice since UnifyFS
does not (currently) require the second call to MPI_File_sync()
as noted above.