Choosing Dimensions

One of the fundamental questions when designing the array schema is "what are my dimensions and what are my attributes"? The answer depends on and is rather related to whether your array is dense or sparse. Two good rules of thumb that apply to both dense and sparse arrays:

If you frequently perform range slicing over a field/column of your dataset, you should consider making it a dimension.

The order of the dimensions in the array schema matters. More selective dimensions (i.e., with greater pruning power) should be defined before less selective ones.

Dense Arrays

In dense arrays, telling the dimensions from attributes is potentially more straightforward. If you can model your data in a space where every cell has a value (e.g., image, video), then your array is dense and the dimensions will be easy to discern (e.g., width and height in an image, or width, height and time in video).

It is important to remember that dense arrays in TileDB do not explicitly materialize the coordinates of the cells (in dense fragments) and, therefore, which may result in significant savings if you are able to model your array as dense. Moreover, reads in dense arrays may be faster than in sparse, as dense arrays use implicit spatial indexing and, therefore, the internal structures and state are much more lightweight.

Sparse Arrays

It is always good to think of a sparse dataset as a dataframe before you start designing your array, i.e., as a set of "flattened" columns with no notion of dimensionality .Then follow the two rules of thumb above.

Recall that TileDB explicitly materializes the coordinates of the sparse cells. Therefore, make sure that the array sparsity is large enough. If the array is rather dense, then you may consider defining it as dense, filling the empty cells with some user-defined "dummy" values that you can recognize (so that you can filter them out manually after receiving the results from TileDB).

Last updated