Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
One of the fundamental questions when designing the array schema is "what are my dimensions and what are my attributes"? The answer depends on and is rather related to whether your array is dense or sparse. Two good rules of thumb that apply to both dense and sparse arrays:
If you frequently perform range slicing over a field/column of your dataset, you should consider making it a dimension.
The order of the dimensions in the array schema matters. More selective dimensions (i.e., with greater pruning power) should be defined before less selective ones.
In dense arrays, telling the dimensions from attributes is potentially more straightforward. If you can model your data in a space where every cell has a value (e.g., image, video), then your array is dense and the dimensions will be easy to discern (e.g., width and height in an image, or width, height and time in video).
It is important to remember that dense arrays in TileDB do not explicitly materialize the coordinates of the cells (in dense fragments) and, therefore, which may result in significant savings if you are able to model your array as dense. Moreover, reads in dense arrays may be faster than in sparse, as dense arrays use implicit spatial indexing and, therefore, the internal structures and state are much more lightweight.
It is always good to think of a sparse dataset as a dataframe before you start designing your array, i.e., as a set of "flattened" columns with no notion of dimensionality .Then follow the two rules of thumb above.
Recall that TileDB explicitly materializes the coordinates of the sparse cells. Therefore, make sure that the array sparsity is large enough. If the array is rather dense, then you may consider defining it as dense, filling the empty cells with some user-defined "dummy" values that you can recognize (so that you can filter them out manually after receiving the results from TileDB).
Recall that in dense arrays (and, more precisely, dense fragments), there is a one-to-one correspondence between a space tile and a data tile (for each attribute), and the data tile is the atomic unit of IO. TileDB will fetch all data tiles corresponding to every space tile that overlaps with the query subarray. Therefore, the tile extents along each dimension must be set in a way that the resulting space tiles follow more or less the shape of the most typical subarray queries.
Furthermore, the size of the space tile affects performance. A larger tile may lead to fetching a lot of irrelevant to the query data, but it can also result in better compression ratio and parallelism (for both filtering and IO). It is recommended for the space tile to be defined in a way that the corresponding data tiles along each attribute are at least 64KB (a typical size for the L1 cache).
Recall also that, in addition to the space tile, the tile and cell order determine the global cell order, i.e., the way the data tile values are laid out in physical storage. In general, both the tile and cell order should follow the layout in which you are expecting the read results. This maximizes your chances that the relevant data are concentrated in smaller byte regions in the physical files, which TileDB can exploit to fetch the data faster.
In sparse arrays (and, more precisely, sparse fragments), there is no one-to-one correspondence between space tiles and data tiles. In contrast, the space tiles help determine the global cell order and, therefore, can be used to preserve the spatial locality of the values in physical storage. By determining the global cell order, space tiles effectively determine the shape of the MBRs of the data tiles. Recall that TileDB fetches only the data tiles whose MBR intersects with the subarray query. Therefore, the tile extents must once again be set in a way that the resulting space tiles have a similar shape to the subarray query.
Recall also that the size of the data tile in sparse arrays is not determined by the tile extents, but rather by the tile capacity. Similar to the case of dense arrays, it must be set appropriately to balance fetching irrelevant data and maximizing compression ratio and parallelism.
Finally, for similar reasons to the dense case, it is recommended that the tile and cell order be chosen according to the typical result layout.
The following is a high-level overview of the read operations performed during a TileDB array query:
Array open
List all array schema files in the __schemas
prefix.
Download all array schema files from the __schemas
prefix.
List all fragments in the __fragments
prefix.
Download __fragment_metadata
files for each fragment within the timestamp bounds of the active array.
Cell selection:
(Sparse arrays only): download all dimension (d*.tdb
) tiles within the ranges requested by the query, based on the minimum bounding rectangles for each fragment.
(Dense and sparse arrays):
If the query includes a query condition, download all attribute tiles necessary to evaluate the query condition.
If the array has been consolidated with timestamps, download all timestamp tiles necessary for timestamp filtering.
Dense read:
Download all tiles in each requested attribute (a*.tdb
files) intersecting the query ranges. Tiles will be filtered based on the query condition selector, if provided.
Sparse read:
Download all tiles in each requested attribute (a*.tdb
files) with cell coordinates matching the query ranges. Tiles will be filtered based on the query condition selector, if provided.
The subarray shape must follow the space tiling of the array as much as possible. See Choosing Tiling and Cell Layout since read performance is highly dependent on the array tiling.
The most efficient layout to read in is global order, as TileDB will avoid sorting and re-organizing the result cells internally as much as possible. Row-major and col-major layouts require sorting and thus more work for TileDB. In general, the read layout must coincide as much as possible with the global order.
The larger the read, the better for performance. This is because TileDB can take advantage of parallelism and perform better parallel tile filtering and IO.
Section Read Parallelism describes in detail how TileDB internally parallelizes reads. You should consider fine-tuning the parameters explained therein.
TileDB is a powerful engine that provides you with great flexibility to adapt it to your application. If tuned properly, TileDB can offer extreme performance even in the most challenging settings. Unfortunately, performance tuning in general is a complex topic, and there are many factors that influence the performance of TileDB. There is no unified way to improve performance for all TileDB programs, as the space of tradeoffs is quite large, and each use case can potentially benefit from very different choices for these tradeoffs.
To facilitate performance tuning, in this page group we provide helpful tips for tuning TileDB for your application. TileDB also has functionality for inspecting performance statistics, which may reveal some obvious performance deficiencies and further help in optimizing performance.
We are always happy to help in case you do not get the desired performance out of TileDB in your use case. Please post a comment on the TileDB Forum or drop us a line at hello@tiledb.com
sharing with us the way you create/write/read your TileDB arrays and we will quickly get back to you with suggestions.
The most efficient layout to write in is global order, as TileDB will not need to sort and re-organize the written cells internally. Row-major, col-major and unordered layouts require sorting and thus more work for TileDB. In general, the write layout and the sub-domain in which the write occurs must coincide as much as possible with the space tiles and the global order.
The larger the writes, the better for performance. This is because TileDB can take advantage of parallelism and perform better parallel tile filtering and IO. Moreover, larger (and more infrequent) writes generally lead to fewer fragments, thus affecting read performance.
What affects read performance with respect to fragments is:
Number of fragments
Overlap of fragment non-empty domains.
The more fewer and more "spatially disjoint" the fragments are, the better TileDB can prune fragments during reads and achieve better performance. Make sure that each write operation focuses on a separate sub-domain.
Section describes in detail how TileDB internally parallelizes writes. You should consider fine-tuning the parameters explained therein.
The best scenario for maximizing the performance of reads is to have a single fragment. The only way to result in a single fragment is by (i) performing a single write (which may not be possible in applications where the data is much larger than RAM), (ii) writing in global order, i.e., appending data to your fragments (which may not be possible in applications where the data do not arrive in global order), and (iii) frequently consolidating your fragments, which is the most reasonable choice for most applications. However, properly tuning consolidation for an application may be challenging.
We provide a few tips for maximizing the consolidation performance.
Perform dense writes in subarrays that align with the space tile boundaries. This will prevent TileDB from filling with empty cell values, which may make finding consolidatable fragments more difficult.
Update the (dense) array by trying to rewrite the same dense subarrays. This helps the pre-processing clean up process, which will try to delete older fully overwritten fragments rapidly without performing consolidation.
For sparse arrays (or sparse writes in dense arrays), perform writes of approximately equal sizes. This will lead to balanced consolidation.
It may be a good idea to invoke consolidation after every write, tuning sm.consolidation.step_min_frags
, sm.consolidation.step_max_frags
and sm.consolidation.steps
to emulate the way LSM-Trees work. Specifically, choose a reasonable value for sm.consolidation.step_min_frags
and sm.consolidation.step_max_frags
, e.g., 2-20. This will ensure that only small number of fragments gets consolidated per step. Then you can set the number of steps (sm.consolidation.steps
) to something large, so that consolidation proceeds recursively until there is a single fragment (or very few fragments). If consolidation is invoked after each write, the consolidation cost will be amortized over all ingestion processes in the lifetime of your system. Note that in that case, the consolidation times will be quite variable. Sometimes no consolidation will be needed at all, sometimes a few fast consolidation steps may be performed (involving a few small fragments), and sometimes (although much less frequently), consolidation may take much longer as it may be consolidating very large fragments. Nevertheless, this approach leads to a great amortized overall ingestion time resulting in very few fragments and, hence, fast reads.
Increase the buffer size used internally during consolidation. This is controlled by config parameter sm.consolidation.buffer_size
, which determines the buffer size used per attribute when reading from the old fragments and before writing to the new consolidated fragment. A larger buffer size increases the overall performance.
Very large fragments will not perform optimally when reading arrays. For that reason, you can change the sm.consolidation.max_fragment_size
parameter to split the result of the conslidation into multiple fragments that will be smaller in size than the set value in bytes.
See Configuration for information on how to set the above mentioned configuration parameters.
TileDB's Python integration works well with Python's multiprocessing
ThreadPoolExecutor
and ProcessPoolExecutor
. We have a large usage example demonstrating parallel CSV ingestion, here, which may be run in either threadpool or processpool mode.
Caution: the default multiprocessing
execution method for ProcessPoolExecutor
on Linux is not compatible with TileDB (nor with most other multi-threaded applications) due to complications of global process state after fork
. ProcessPoolExecutor
must be used with multiprocessing.set_start_method("spawn")
to avoid unexpected behavior (such as hangs and crashes).
Here you can find a summary of the most important performance factors of TileDB.
The following factors must be considered when designing the array schema. In general, designing an effective schema that results in high performance depends on the type of data that the array will store, and the type of queries that will be subsequently issued to the array.
Choosing a dense or sparse array representation depends on the particular data and use case. Sparse arrays in TileDB have additional overhead versus dense arrays due to the need to explicitly store cell coordinates of the non-empty cells. For a particular use case, this extra overhead may be outweighed by the storage savings of not storing dummy values for empty cells in the dense alternative, in which case a sparse array would be a good choice. Sparse array reads also have additional runtime overhead due to the potential need of sorting the cells based on their coordinates.
In dense arrays, the space tile shape and size determines the atomic unit of I/O and compression. All space tiles that intersect a read query’s subarray are read from disk whole, even if they only partially intersect with the query. Matching space tile shape and size closely with the typical read query can greatly improve performance. In sparse arrays, space tiles affect the order of the non-empty cells and, hence, the minimum bounding rectangle (MBR) shape and size. All tiles in a sparse array whose MBRs intersect a read query’s subarray are read from disk. Choosing the space tiles in a way that the shape of the resulting MBR’s is similar to the read access patterns can improve read performance, since fewer MBRs may then overlap with the typical read query.
The tile capacity determines the number of cells in a data tile of a sparse fragment, which is the atomic unit of I/O and compression. If the capacity is too small, the data tiles can be too small, requiring many small, independent read operations to fulfill a particular read query. Moreover, very small tiles could lead to poor compressibility. If the capacity is too large, I/O operations would become fewer and larger (a good thing in general), but the amount of wasted work may increase due to reading more data than necessary to fulfill the typical read query.
If read queries typically access the entire domain of a particular dimension, consider making that dimension an attribute instead. Dimensions should be used when read queries will issue ranges of that dimension of cells to retrieve, which facilitates TileDB to internally prune cell ranges or whole tiles from the data that must be read.
Tile filtering (such as compression) can reduce persistent storage consumption, but at the expense of increased time to read/write tiles from/to the array. However, tiles on disk that are smaller due to compression can be read from disk in less time, which can improve performance. Additionally, many compressors offer different levels of compression, which can be used to fine-tune the tradeoff.
For sparse arrays, if the data is known to not have any entries that have the same coordinates or if read queries don't mind getting multiple results for a coordinate, allowing duplicates can lead to significant performance improvements. For queries asking for results in the unordered layout, TileDB can process fragments in any order (and less of them at once) as well as skip deduplication of the results, so they will run much faster and require less memory overhead.
The following factors impact query performance, i.e., when reading/writing data from/to the array.
Large numbers of fragments can slow down performance during read queries, as TileDB must fetch a lot of fragment metadata, and internally sort and determine potential overlap of cell data across all fragments. Therefore, consolidating arrays with a large number of fragments improves performance; consolidation collapses a set of fragments into a single one.
A benefit of the column-oriented nature of attribute storage in TileDB is if a read query does not require data for some of the attributes in the array, those attributes will not be touched or read from disk at all. Therefore, specifying only the attributes actually needed for each read query will improve performance. In the current version of TileDB, write queries must still specify values for all attributes.
When an ordered layout (row- or column-major) is used for a read query, if that layout differs from the array physical layout, TileDB must internally sort and reorganize the cell data that is read from the array into the requested order. If you do not care about the particular ordering of cells in the result buffers, you can specify a global order query layout, allowing TileDB to skip the sorting and reorganization steps.
Similarly, if your specified write layout differs from the physical layout of your array (i.e., the global cell order), then TileDB needs to internally sort and reorganize the cell data into global order before writing to the array. If your cells are already laid out in global order, issuing a write query with global-order allows TileDB to skip the sorting and reorganization step. Additionally, multiple consecutive global-order writes to an open array will append to a single fragment rather than creating a new fragment per write, which can improve later the read performance. Care must be taken for global-order writes, as it is an error to write data to a subarray that is not aligned with the space tiling of the array.
Before issuing queries, an array must be “opened.” When an array is opened, TileDB reads and decodes the array and fragment data, which may require disk operations (and potentially decompression). An array can be opened once, and many queries issued to it before closing. Minimizing the number of array open and close operations can improve performance.
When reading from an array, metadata from relevant fragments is fetched from the backend and cached in memory. The more queries you perform, then more metadata is fetched throughout time. In order to save memory space, you will need to occasionally close and reopen the array.
The following runtime configuration parameters can tune the performance of several internal TileDB subsystems.
When a read query causes a tile to be read from disk, TileDB places the uncompressed tile in an in-memory LRU cache associated with the query’s context. When subsequent read queries in the same context request that tile, it may be copied directly from the cache instead of re-fetched from disk. The sm.tile_cache_size
config parameter determines the overall size in bytes of the in-memory tile cache. Increasing it can lead to a higher cache hit ratio, and better performance.
During sparse writes, setting sm.dedup_coords
to true will cause TileDB to internally deduplicate cells being written so that multiple cells with the same coordinate tuples are not written to the array (which is an error). A lighter-weight check is enabled by default with sm.check_coord_dups
, meaning TileDB will simply perform the check for duplicates and return an error if any are found. Disabling these checks can lead to better performance on writes.
During sparse writes, setting sm.check_coord_oob
to true
(default) will cause TileDB to internally check whether the given coordinates fall outside the domain or not. If you are certain that this is not possible in your application, you can set this param to false
, avoiding the check and thus boosting performance.
During sparse writes in global order, setting sm.check_global_order
to true
(default) will cause TileDB to internally check whether the given coordinates obey the global order. If you are certain that this is not possible in your application, you can set this param to false
, avoiding the check and thus boosting performance.
The effect of the sm.consolidation.*
parameters is explained in Consolidation.
Controlled by the sm.memory_budget
, sm.memory_budget_var
parameters, this caps the total number of bytes that can be fetched for each fixed- or var-sized attribute during reads. This can prevent OOM issues when a read query overlaps with a huge number of tiles that must be fetched and decompressed in memory. For large subarrays, this may lead to incomplete queries.
Controlled by the sm.compute_concurrency_level
and sm.io_concurrency_level
parameters. These parameters define an upper-limit on the number of operating system threads to allocate for compute-bound and IO-bound tasks, respectively. These parameters default to the number of logical cores on the system.
During read queries, the VFS system will batch reads for distinct tiles that are physically close together (not necessarily adjacent) in the same file. The vfs.min_batch_size
parameter sets the minimum size in bytes that a single batched read operation can be. VFS will use this parameter to group “close by” tiles into the same batch, if the new batch size is smaller than or equal to vfs.min_batch_size
. This can help minimize the I/O latency that can come with numerous very small VFS read operations. Similarly, vfs.min_batch_gap
defines the minimum number of bytes between the end of one batch and the start of its subsequent one. If two batches are fewer bytes apart than vfs.min_batch_gap
, they get stitched into a single batch read operation. This can help better group and parallelize over “adjacent” batch read operations.
This is controlled by vfs.s3.max_parallel_ops
. This determines the maximum number of parallel operations for s3://
URIs independently of the VFS thread pool size, allowing you to over- or under-subscribe VFS threads. Oversubscription can be helpful in some cases with S3, to help hide I/O latency.
Replacing vfs.min_parallel_size
for S3 objects, vfs.s3.multipart_part_size
controls the minimum part size of S3 multipart writes. Note that vfs.s3.multipart_part_size * vfs.s3.max_parallel_ops
bytes will be buffered in memory by TileDB before actually submitting an S3 write request, at which point all of the parts of the multipart write are issued in parallel.
The number of cores and hardware threads of the machine impacts the amount of parallelism TileDB can use internally to accelerate reads, writes and compression/decompression.
The different types of storage backend (S3, local disk, etc) have different throughput and latency characteristics, which can impact query time.