TileDB allows the user to specify an ordered list of data transformations, such as compression, that can be applied to data tiles before they are written to disk. The user may define different filters for each attribute, the coordinates and the data tile offsets.
TileDB supports a number of filters, and more will continue to be added in the future.
There are several filters performing generic compression, which are the following:
GZIP: Compresses with Gzip
ZSTD: Compresses with Zstandard
LZ4: Compresses with LZ4
RLE: Compresses with run-length encoding
BZIP2: Compresses with Bzip2
DOUBLE_DELTA: Compresses with double-delta encoding
The byte shuffling process rearranges the bytes of the input attribute cell values in a deterministic and reversible manner designed to result in long runs of similar bytes that can be compressed more effectively by a generic compressor than the original unshuffled elements. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, consider three 32-bit unsigned integer values
1, 2, 3, which have the following little-endian representation when stored adjacent in memory:
0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 0x00 0x00 0x00
The byte shuffle operation will rearrange the bytes of these integer elements in memory such that the resulting array of bytes will contain each element’s first byte, followed by each element’s second byte, etc. After shuffling the bytes would therefore be:
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Observe the longer run of zero-valued bytes, which will compress more efficiently.
Bitshuffling is conceptually very similar to byteshuffling, but operates on the bit granularity rather than the byte granularity. Shuffling at the bit level can increase compression ratios even further than the byteshuffle filter, at the cost of increased computation to perform the shuffle.
Typically this filter is not used on its own, but rather it is immediately followed by a compression filter in a filter list.
This filter performs positive-delta encoding. Positive-delta encoding is a form of delta encoding that only works when the delta value is positive. Positive-delta encoding can result in better compression ratios on the encoded data. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, if the data being filtered was the sequence of integers
100, 104, 108, 112, ..., then the resulting positive-encoded data would be
0, 4, 4, 4, .... This encoding is advantageous in that producing long runs of repeated values can result in better compression ratios, if a compression filter is added after positive-delta encoding.
The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.
This filter performs bit-width reduction, which is a form of compression.
Bit-width reduction examines a window of attribute values, and determines if all of the values in the window can be represented by a datatype of smaller byte size. If so, the values in the window are rewritten as values of the smaller datatype, potentially saving several bytes per cell.
For example, consider an attribute with datatype
uint64. Initially, each cell of data for that attribute requires 8 bytes of storage. However, if you know that the actual value of the attribute is often 255 or less, those cells can be stored using just a single byte in the form of a
uint8, saving 7 bytes of storage per cell. The bit-width reduction filter performs this analysis and compression automatically over windows of attribute data.
Additionally, each cell value in a window is treated relative to the minimum value in that window. For example, if the window size was 3 cells, which had the values
300, 350, 400, the bit-width reduction filter would first determine that the minimum value in the window was
300, and the relative cell values were
0, 50, 100. These relative values are now less than 255 and can be represented by a
If possible, it can be a good idea to apply positive-delta encoding before bit-width reduction, as the positive-delta encoding may further increase the opportunities to store windows of data with a narrower datatype.
Before filtering each data tile of an attribute, TileDB internally divides the tile into disjoint chunks. These chunks are then filtered individually.
Chunking tiles before filtering allows for better cache behavior in terms of temporal locality, as the chunk size can be chosen to fit within the L1 cache of your processor cores. This helps especially with multi-stage filter lists, as the output from the previous filter is likely to still be in L1 when used as input for the current filter.
Chunking tiles also increases the amount of parallel compute that TileDB can make effective use of. By breaking a tile into individual chunks, each chunk can then be filtered in parallel, which can result in excellent CPU utilization when combined with the cache-friendly size of the chunks.
The default chunk size used by TileDB is 64KB, which is the size of many common processor L1 caches.