Comment on page
In addition to using parallelism internally, TileDB is designed having parallel programming in mind. Specifically, scientific computing users may be accustomed to using multi-processing (e.g., via MPI or Dask or Spark), or writing multi-threaded programs to speed up performance. TileDB enables concurrency using a multiple writer / multiple reader model that is entirely lock-free.
Concurrent writes are achieved by having each thread or process create one or more separate fragments for each write operation. No synchronization is needed across processes and no internal state is shared across threads among the write operations and, thus, no locking is necessary. Regarding the concurrent creation of the fragments, thread- and process-safety is achieved because each thread/process creates a fragment with a unique name (as it incorporates a UUID). Therefore, there are no conflicts even at the storage backend level.
TileDB supports lock-free concurrent writes of array metadata as well. Each write creates a separate array metadata file with a unique name (also incorporating a UUID), and thus name collisions are prevented.
During opening the array, TileDB loads the array schema and fragment metadata to main memory once, and shares them across all array objects referring to the same array. Therefore, for the multi-threading case, it is highly recommended that you open the array once outside the atomic block and have all threads create the query on the same array object. This is to prevent the scenario where a thread opens the array, then closes it before another thread opens the array again, and so on. TileDB internally employs a reference-count system, discarding the array schema and fragment metadata each time the array is closed and the reference count reaches zero (the schema and metadata are typically cached, but they still need to be deserialized in the above scenario). Having all concurrent queries use the same array object eliminates the above problem.
Reads in the multi-processing setting are completely independent and no locking is required. In the multi-threading scenario, locking is employed (through mutexes) only when the queries access the tile cache, which incurs a very small overhead.
Concurrent reads and writes can be arbitrarily mixed. Fragments are not visible unless the write query has been completed (and the
.okfile appeared). Fragment-based writes make it so that reads simply see the logical view of the array without the new (incomplete) fragment. This multiple writers / multiple readers concurrency model of TileDB is more powerful than competing approaches, such as HDF5’s single writer / multiple readers (SWMR) model. This feature comes with a more relaxed consistency model, which is described in the Consistency section.
Consolidation can be performed in the background in parallel with and independently of other reads and writes. The new fragment that is being created is not visible to reads before consolidation is completed.
Vacuuming deletes fragments that have been consolidated. Although it can never lead to a corrupted array state, it may lead to issues if there is a read operation that accesses a fragment that is being vacuumed. This is possible when the array is opened at a timestamp before some consolidation operation took place, therefore considering the fragment to be vacuumed. Most likely, that will lead to a segfault or some unexpected behavior.
TileDB locks the array upon vacuuming to prevent the above. This is achieved via mutexes in multi-threading, and file locking in multi-processing (for those storage backends that support it).
All POSIX-compliant filesystems and Windows filesystems support file locking. Note that Lustre supports POSIX file locking semantics and exposes local- (mount with
-o localflock) and cluster- (mount with
-o flock) level locking. Currently, TileDB does not use file locking on HDFS and S3 (these storage backends do not provide such functionality, but rather resource locking must be implemented as an external feature). For filesystems that do not support filelocking, the multi-processing programs are responsible for synchronizing the concurrent writes.
Particular care must be taken when vacuuming arrays on AWS S3 and HDFS. Without filelocking TileDB has no way to prevent vacuuming from deleting the old consolidated fragments. If another process is reading those fragments while consolidation is deleting them, the reading process is likely to error out or crash.
In general, avoid executing vacuuming when time traveling upon reading in cloud object stores. It is generally safe to vacuum if you are reading the array at the current timestamp.
Array creation (i.e., storing the array schema on persistent storage) is not thread-/process-safe. We do not expect a practical scenario where multiple threads/processes attempt to create the same array in parallel. We suggest that only one thread/process creates the array, before multiple threads/processes start working concurrently for writes and reads.