TileDB enables concurrent writes and reads that can be arbitrarily mixed, without affecting the normal execution of a parallel program. This comes with a more relaxed consistency model, called eventual consistency. Informally, this guarantees that, if no new updates are made to an array, eventually all accesses to the array will “see” the last collective global view of the array (i.e., one that incorporates all the updates). Everything discussed in this section about array fragments is also applicable to array metadata.
We illustrate the concept of eventual consistency in the figure below (which is the same for both dense and sparse arrays). Suppose we perform two writes in parallel (by different threads or processes), producing two separate fragments. Assume also that there is a read at some point in time, which is also performed by a third thread/process (potentially in parallel with the writes). There are five possible scenarios regarding the logical view of the array at the time of the read (i.e., five different possible read query results). First, no write may have completed yet, therefore the read sees an empty array. Second, only the first write got completed. Third, only the second write got completed. Fourth, both writes got completed, but the first write was the one to create a fragment with an earlier timestamp than the second. Fifth, both writes got completed, but the second write was the one to create a fragment with an earlier timestamp than the first.
Illustration of eventual consistency
The concept of eventual consistency essentially tells you that, eventually (i.e., after all writes have completed), you will see the view of the array with all updates in. The order of the fragment creation will determine which cells are overwritten by others and, hence, greatly affects the final logical view of the array.
Eventual consistency allows high availability and concurrency. This model is followed by the AWS S3 object store and, thus, TileDB is ideal for integrating with such distributed storage backends. If strict consistency is required for some application (e.g., similar to that in transactional databases), then an extra layer must be built on top of TileDB Embedded to enforce additional synchronization.
But how does TileDB deal internally with consistency? This is where opening an array becomes important. When you open an array (at the current time or a time in the past), TileDB takes a snapshot of the already completed fragments. This the view of the array for all queries that will be using that opened array object. If writes happen (or get completed) after the array got opened, the queries will not see the new fragments. If you wish to see the new fragments, you will need to either open a new array object and use that one for the new queries, or reopen the array (reopening the array bypasses closing it first, permitting some performance optimizations).
We illustrate with the figure below. The first array depicts the logical view when opening the array. Next suppose a write occurs (after opening the array) that creates the fragment shown as the second array in the figure. If we attempt to read from the opened array, even after the new fragment creation, we will see the view of the third array in the figure. In other words, we will not see the updates that occurred between opening and reading from the array. If we'd like to read from the most up-to-date array view (fourth array in the figure), we will need to reopen the array after the creation of the fragment.
Different views when opening the array in the presence of concurrent writes
When you write to TileDB with multiple processes, if your application is the one to be synchronizing the writes across machines, make sure that the machine clocks are synchronized as well. This is because TileDB sorts the fragments based on the timestamp in their names, which is calculated based on the machine clock.
Here is how TileDB reads achieve eventual consistency on AWS S3:
- 1.Upon opening the array, list the fragments in the array folder
- 2.Consider only the fragments that have an associated
.okfile (the ones that do not have one are either in progress or not visible due to S3’s eventual consistency)
.okfile is PUT after all the fragment data and metadata files have been PUT in the fragment folder.
- 4.Any access inside the fragment folder is performed with a byte range GET request, never with LIST. Due to S3’s read-after-write consistency model, those GET requests are guaranteed to succeed.
The above practically tells you that a read operation will always succeed and never be corrupted (i.e., it will never have results from partially written fragments), but it will consider only the fragments that S3 makes visible (in their entirety) at the timestamp of opening the array.