Writing
Last updated
Last updated
TileDB is architected to support parallel batch writes, i.e., writing collections of cells with multiple processes or threads. Each write operation creates one or more dense or sparse fragments. Updating an array is equivalent to initiating a new write operation, which could either insert cells in unpopulated areas of the domain or overwrite existing cells (or a combination of the two). TileDB handles each write separately and without any locking. Each fragment is immutable, i.e., write operations always create new fragments, without altering any other fragment.
A dense write is applicable to dense arrays and creates one or more dense fragments. In a dense write, the user provides:
The subarray to write into (it must be single-range).
The buffers that contain the attribute values of the cells that are being written.
The cell order within the subarray (which must be common across all attributes), so that TileDB knows which values correspond to which cells in the array domain. The cell order may be row-major, column-major, or global.
The example below illustrates writing into a subarray of an array with a single attribute. The figure depicts the order of the attribute values in the user buffers for the case of row- and column-major cell order. TileDB knows how to appropriately re-organize the user-provided values so that they obey the global cell order before storing them to disk. Moreover, note that TileDB always writes integral space tiles to disk. Therefore, it will inject special empty values (depicted in grey below) into the user data to create full data tiles for each space tile.
Writing in the array global order needs a little bit more care. The subarray must be specified such that it coincides with space tile boundaries, even if the user wishes to write in a smaller area within that subarray. The user is responsible for manually adding any necessary empty cell values in her buffers. This is illustrated in the figure below, where the user wishes to write in the blue cells, but has to expand the subarray to coincide with the two space tiles and provide the empty values for the grey cells as well. The user must provide all cell values in the global order, i.e., following the tile order of the space tiles and the cell order within each space tile.
Writing in global order requires knowledge of the space tiling and cell/tile order, and is rather cumbersome to use. However, this write mode leads to the best performance, because TileDB does not need to internally re-organize the cells along the global order. It is recommended for use cases where the data arrive already grouped according to the space tiling and global order (e.g., in geospatial applications).
TileDB uses the following default fill values for empty cells in dense writes, noting that the user can specify any other fill value upon array creation:
Datatype
Default fill value
TILEDB_CHAR
Minimum char
value
TILEDB_INT8
Minimum int8
value
TILEDB_UINT8
Maximum uint8
value
TILEDB_INT16
Minimum int16
value
TILEDB_UINT16
Maximum uint16
value
TILEDB_INT32
Minimum int32
value
TILEDB_UINT32
Maximum uint32
value
TILEDB_INT64
Minimum int64
value
TILEDB_UINT64
Maximum uint64
value
TILEDB_FLOAT32
NaN
TILEDB_FLOAT64
NaN
TILEDB_ASCII
0
TILEDB_UTF8
0
TILEDB_UTF16
0
TILEDB_USC2
0
TILEDB_USC4
0
TILEDB_ANY
0
TILEDB_DATETIME_*
Minimum int64
value
In the case a fixed-sized attribute stores more than one values, all the cell values will be assigned the corresponding default value shown above.
Sparse writes are applicable to sparse arrays and create one or more sparse fragments. The user must provide:
The attribute values to be written.
The coordinates of the cells to be written.
The cell layout of the attribute and coordinate values to be written (must be the same across attributes and dimensions). The cell layout may be unordered or global.
Note that sparse writes do not need to be constrained in a subarray, since they contain the explicit coordinates of the cells to write into. The figure below shows a sparse write example with the two cell orders. The unordered layout is the easiest and most typical. TileDB knows how to appropriately re-organize the cells along the global order internally before writing the values to disk. The global layout is once again more efficient but also more cumbersome, since the user must know the space tiling and the tile/cell order of the array, and manually sort the values before providing them to TileDB.