Glossary

Last updated 12 months ago

Was this helpful?

Glossary

Array metadata

The array metadata are simple key-value pairs that the user can attach to an array. The key is a string and the value can be of any datatype. The array metadata is typically small. Time traveling applies to array metadata as well, i.e., opening an array at a timestamp will fetch only the array metadata created at or before the given timestamp.

Both the and the array metadata store information about the array, and the user is responsible for setting and configuring them. The easiest way to remember the difference between the array metadata and the array schema is the following:

The array metadata stores user-specific data about the array that is arbitrary key-value pairs.
The array schema stores system-specific data about the array that has a fixed structure (e.g., a dimension name, domain and datatype).

Array schema

The array schema stores all the details about the array definition. Some of the data it holds are:

Attributes (name, datatype, filters)
Dimensions (name, datatype, domain, filters)
Tile extent and capacity
Tile and cell order

See the for more details.

TileDB Cloud offers a very simple way of sharing arrays with anyone on the planet. It effectively provides a way to defined access control on arrays, and log every single action for auditability purposes.

Attribute

A non-empty cell (in either a dense or sparse array) is not limited to storing a single value. Each cell stores a tuple with a structure that is common to all cells. Each tuple element corresponds to a value on a named attribute of a certain type. An attribute can be:

Fixed-sized: an attribute value in a cell may consist of one or a fixed number of values of the same datatype
Variable-sized: an attribute value in a cell may consist of a variable number of values of the same datatype, i.e., different cells may store a different number of values on this attribute.

The figure below shows an example of an array with 3 attributes; a1 of type int32, a2 of type char: var and a3 of type float32: 2. Every non-empty cell must store 1 int32 value on a1, any number of char values on a2 and exactly 2 float32 values on a3.

Cell

An ordered tuple of dimension domain values, called coordinates, identifies an array cell. The order of the coordinates must follow the order in which the array dimensions were specified. The figure below depicts an example of cell (3, 4) assuming that the dimension order is d1, d2.

Consolidation

Coordinates

The coordinates of an array cell is an ordered tuple of dimension domain values that identifies it. In dense arrays, the coordinates of each cell are unique. In sparse arrays, the same coordinates may appear more than once.

Data tile

TileDB adopts the so-called columnar format and stores the (non-empty) cell values for each attribute separately. A data tile is a subset of cell values on a particular attribute. We explain the data tile separately for dense and sparse fragments, and its relationship to the space tile. The data tile is the atomic unit of compression and IO.

Contrary to dense fragments, there is no correspondence between space tiles and data tiles in sparse fragments. Consider the 8x8 fragment with 4x4 space tiles in the figure below. Assume for simplicity that the array stores a single int32 attribute. The non-empty cells are depicted in blue color. If we followed the data tiling technique of dense fragment, we would have to create 4 data tiles, one for each space tile. TileDB does not materialize empty cells, i.e., it stores only the values of the non-empty cells in the data files. Therefore, the space tiles would produce 4 data tiles with 3 (upper left), 12 (upper right), 1 (lower left) and 2 (lower right) non-empty cells.

The physical tile size imbalance that may result from space tiling can lead to ineffective compression (if numerous data tiles contain only a handful of values), and inefficient reads (if the subarray you wish to read only partially intersects with a huge tile, which needs to be fetched in its entirety and potentially decompressed). Ideally, we wish every data tile to store to the same number of non-empty cells. Recall that this is achieved in the dense case by definition, since each space tile has the same shape (equal number of cells) and all cells in each space tile are non-empty. Finally, since the distribution of the non-empty cells in the array may be arbitrary, it is phenomenally difficult to fine-tune the space tiling in a way that leads to load-balanced data tiles storing an acceptable number of non-empty cells, or even completely unattainable.

In other words, the space tiles in sparse fragments are used to determine the global cell order that will dictate which cell values will be grouped together in the same data tile. Another difference to dense fragments is that sparse fragments create extra data tiles for the coordinates of the non-empty cells, which is important in reading.

Dense vs. sparse array

There are three main differences between a dense and a sparse array:

A dense array is used when the majority of the cells are non-empty (within any hyper-rectangular sub domain), whereas a sparse array when the majority of the cells are empty.
The dimensions of a dense array must have the same datatype, whereas the dimensions of a sparse array may have different datatypes.
The dimensions of a dense array can only be of integer data type, whereas the dimensions of a sparse array may be of any data type (even real or string).
Every cell in a dense array is uniquely identified by its coordinates, whereas a sparse array can permit multiplicities, i.e., cells with the same coordinates but potentially different attribute values, as well as real (float32, float64) and string domains.

TileDB provides a unified API for both dense and sparse arrays.

Dimension

A multi-dimensional array consists of a set of ordered dimensions. A dimension has a name, a datatype and a domain. The figure below shows an example of two int32 dimensions, d1 with domain [1,4] and d2 with domain [2,6].

Domain

The array domain (or simply domain) is the hyperspace defined by the domains of the array dimensions. In a dense array, all dimensions must have the same type (homogeneous dimensions) and can only be integers. In a sparse array, the dimensions may have different type (heterogenous dimensions) and can be of any data type (even real and string).

The non-empty domain is the tightest hyper-rectangle that contains all non-empty cells. An example is shown in the figure below.

The dimension domains can have negative, real and string values. An array cell is still identified by its coordinates, which take any value from the corresponding dimension domain.

In our examples, the orientation of each dimension domain is rather arbitrary and does not affect the array definition. It is just a matter of convention. For example, the lower values may be at the top or bottom of the vertical dimension.

Empty cell

Not all array cells may contain values. A cell that contains values is called non-empty cell, otherwise it is called empty.

Fill values

Filters

Fragment

A fragment is a timestamped snapshot of a portion of the array, which is produced during writes. A fragment may be dense or sparse as shown in the figure below. In a dense fragment, the non-empty cells are contained in a full hyper-rectangle in the domain. This hyper-rectangle may cover the full domain or any subdomain. In a sparse fragment, the non-empty cells may be arbitrary, i.e., not necessarily comprise a full hyper-rectangle.

An array may consist of multiple fragments. Those fragments are completely transparent to the user, who only sees the combined logical view of the array upon reading. This is produced by superimposing the more recent fragments on top of the older ones, with the more recently written cells overwriting the older ones. A dense array may consist of both dense and sparse fragments, but a sparse array may consist only of sparse fragments.

Fragment metadata

The fragment metadata is system-specific information about a fragment. Some of the information this metadata includes is:

Dense or sparse
Non-empty domain
Tile offsets
Tile sizes
R-Tree (for the sparse case)

Global cell order

The tile and cell order collectively determine the global cell order. The global cell order is essentially a mapping from the multi-dimensional cell space to the 1-dimensional physical storage space for the non-empty cells, i.e., it is the order in which TileDB stores the cell values on disk. The figure below shows the 4 possible global cell orders resulting from all combinations of tile/cell orders. The numbers indicate the relative positions of the non-empty cells along the global order.

Groups

Groups allows hierarchically organizing arrays and other groups.

Incomplete query

Non-empty domain

The non-empty domain of an array is the minimum bounding hype-rectangle that tightly encompasses all non-empty cells in the array.

Nullable attribute

R-tree

Space tile & tile extent

A space tile is defined by specifying a tile extent along each dimension. The domain of each dimension is partitioned into segments equal to the tile extent, and hyper-rectangular tiles are formed in the multi-dimensional array space. The space tile concept applies to both dense and sparse arrays (as well as real dimensions) and is independent of the actual data stored in the array.

Subarray

A subarray is an array slice. A single-range subarray is defined by a domain range along each dimension. A multi-range subarray is defined by a multiple ranges per dimension. The resulting slice of a multi-range subarray is oriented by the cross-product of the ranges along all dimensions. Multi-range subarrays are applicable only to reads. Multi-range subarrays are applicable to both dense and sparse arrays.

Tile & cell order

Row-major: Assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, row-major means that the rightmost coordinate index “varies the fastest”.
Column-major: Assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, column-major means that the leftmost coordinate index “varies the fastest”.