1 of 100

TileDB Docs

Welcome to TileDB Open Source!

What is TileDB Open Source?

TileDB Open Source is a universal storage engine that stores any kind of data (beyond tables) in a powerful unified format, offering extreme interoperability via many APIs and tool integrations.

TileDB Open Source is a powerful engine architected around multi-dimensional arrays that enables storing and accessing:

Dense arrays (e.g., images, video and more)
Sparse arrays (e.g., LiDAR, genomics and more)
Dataframes (any tabular data, as either dense or sparse arrays)
Any data that can be modeled as arrays (e.g., graphs, key-values, ML models, etc.)

You can use TileDB to store data in a variety of applications, such as Genomics, Geospatial, Biomedical Imaging, Finance, Machine Learning, and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array, which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite programming language or data science tool via our numerous APIs and integrations.

TileDB Open Source is a fast embeddable C++ library with the following main features:

Open-source under the MIT license
Fast multi-dimensional slicing via tiling (i.e., chunking)
Multiple compression, encryption and checksum filters
Fast, lock-free ingestion
Parallel IO for both reads and writes
Cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage)
A fully multi-threaded implementation
Query condition execution push-down
Schema evolution
Data versioning and time traveling
Metadata stored alongside the array data
Groups for hierarchical organization of array data
A growing set of APIs (C, C++, C#, Python, Java, R, Go),
Numerous integrations (Spark, Dask, MariaDB, GDAL, and more)

Code and APIs

The TileDB Open Source engine is built in C++ and exposes a C and a C++ API:

https://github.com/TileDB-Inc/TileDB

We maintain a growing set of language APIs built on top of the C and C++ APIs:

Python: https://github.com/TileDB-Inc/TileDB-Py
R: https://github.com/TileDB-Inc/TileDB-R
Java: https://github.com/TileDB-Inc/TileDB-Java
Go: https://github.com/TileDB-Inc/TileDB-Go
C#: https://github.com/TileDB-Inc/TileDB-CSharp

Integrations and Extensions

We extended TileDB Open Source to capture domain-specific aspects of important use cases:

Population Genomics: An extension for storing and accessing genomic variant (VCF) data
Geospatial: Integrations with PDAL, GDAL, Rasterio and MapServer
Distributed computing: Integration with Spark and Dask
SQL: Integration with MariaDB, Presto and Trino

Getting Started

Our blog post Why Arrays as a Universal Model is a good starting point for understanding why we chose arrays as first-class citizens in TileDB Open Source.

There is a constantly growing set of tutorials in the TUTORIALS page group found in the left navigation menu of these docs.

If you'd like to take a deeper dive into the TileDB Open Source internals, you can check BACKGROUND in the left navigation menu. You can also always consult the HOW TO guides and the API REFERENCE.

Finally, detailed information about the various TileDB Open Source tool integrations and extensions can be found under the INTEGRATIONS & EXTENSIONS page group in the left navigation menu.

How to Use the Docs

To make it easy to understand where to find what you are looking for, the documentation is structured in the following sections:

Tutorials A series of examples for learning how to use TileDB in various use cases
Background Explanation of key topics and concepts
How To Short how-to guides for all different features of TileDB
API Reference
Technical reference to the APIs
Extensions & integrations Detailed documentation on the TileDB Open Source extensions and integrations

History

TileDB started at MIT and Intel Labs as a research project in late 2014 that led to a VLDB 2017 paper. In May 2017 it was spun out into TileDB, Inc., a company that has since raised over $20M to further develop and maintain the project (see Series A announcement).

The company maintains two offerings:

The open-source storage TileDB Embedded engine, which is covered in this documentation.
The commercial data management platform called TileDB Cloud, which builds upon TileDB Embedded and offers data governance, scalable serverless compute and more.

Get Involved

TileDB Embedded along with its APIs and integrations are open-source projects and welcome all forms of contributions. Contributors to the project should read the contribution docs for more information.

We'd love to hear from you. Drop us a line at hello@tiledb.com, join our Slack community, visit our forum or contact form, or follow us on Twitter to stay informed of updates and news.

Other Resources

You can also check out the TileDB blogs and events (webinars and workshops) to learn more about the TileDB vision, value proposition and use cases, as well as meet the team behind all this amazing work.

Tutorials

Introduction

This tutorial section is under heavy development. Numerous new tutorials across a wide range of use cases are coming up soon, so stay tuned!

In this section we will be providing links to Jupyter notebooks hosted on TileDB Cloud. From there you can download and run them locally, no TileDB Cloud account is needed. Alternatively, you can launch them directly in TileDB Cloud. For that you will need to sign up, and contact us to tell us a bit about your use case if you'd like free credits for your trial (no credit card information is needed).

The pages listed below split the various tutorials by category. You can also navigate through all tutorials directly from the TileDB-Inc/Tutorials group on TileDB Cloud.

Arrays

Below we provide links to various tutorials for dense and sparse arrays. Currently, all those tutorials are built in Python (using TileDB-Py), but soon we will add tutorials that use the other TileDB APIs as well.

Dense Arrays

Dense Array Basics: Learn how to create a dense array, inspect the array schema, write to and read from the array, write and read array metadata, create arrays with multiple attributes and var-sized attributes, treat dense arrays as dataframes and even run SQL queries.
Tile Filters: Learn how to set compression and other filters to attributes in dense arrays, and use encryption in dense arrays.
Tiling
Versioning and time traveling
Consolidation and vacuuming
Schema evolution

Sparse Arrays

Sparse Array Basics: In this tutorial you will learn how to create a sparse array, inspect the array schema, write to and read from the array, write and read array metadata, create arrays with multiple attributes and var-sized attributes, create arrays with string dimensions, create arrays with heterogeneous dimensions, treat sparse arrays as dataframes and even run SQL queries.
Tile filters
Tiling
Versioning and time traveling
Consolidation and vacuuming
Schema evolution
Attribute filtering push-down

Dataframes

Below you can find links to various tutorials for dataframes. Currently, all these tutorials are built in Python (using TileDB-Py), but soon we will add tutorials that use the other TileDB APIs as well.

Dataframe Basics Learn how to ingest a CSV file as a dense (with row id indexing) or sparse (with multi-column indexing) array, inspect the array schema, slice the ingested dataframe, subselect on columns (array attributes), read into the Apache Arrow format, apply conditions on columns, and run SQL queries

Background

Why Arrays?

Designing a universal data model

These are exciting times for anyone working on data problems, as the data industry is as hot and as hyped as ever. Numerous databases, data warehouses, data lakes, lakehouses, feature stores, metadata stores, file managers, etc. have been hitting the market in the past few years. At TileDB we are trying to answer a simple question: instead of building a new data system every time our data needs change, can we build a single database that can store, govern, and process all data — tables, images, video, genomics, LiDAR, features, metadata, flat files and any other data type that may pop up in the future?

This question was born from the simple observation that all database systems (and variations) share significant similarities, including laying data out on the storage medium of choice, and fetching it for processing based on certain query workloads. Therefore, to answer the above question, we had to ask a slightly different one: is there a data model that can efficiently capture all data from all applications? Because if such a universal data model exists, it can serve as the foundation for building a universal database with all the subsystems common to all databases (query planner, executor, authenticator, transaction manager, APIs, etc.). We discovered that such a model does exist, and it is based on multi-dimensional arrays.

Before elaborating on why arrays are universal by describing the data model and their use cases, we need to answer yet another question: why should you care about a universal data model and a universal database? Here are a few important reasons:

Data diversity. You may think that it’s all about tabular data for which a traditional data warehouse (or data lake, or lakehouse) can do the trick, but in reality organizations possess a ton of other very valuable data, such as images, video, audio, genomics, point clouds, flat files and many more. And they wish to perform a variety of operations on these data collections, from analytics, to data science and machine learning.
Vendor optimization. In order to be able to manage their diverse data, organizations resort to either buying numerous different data systems, e.g., a data warehouse, plus an ML platform, plus a metadata store, plus a file manager. That costs money and time; money because some of the vendors have overlapping functionality that you pay twice for (e.g., authentication, access control, etc), and time because teams have to learn to operate numerous different systems, and wrangle data when they need to gain insights by combining disparate data sources.
Holistic governance. Even if organizations are happy with their numerous vendors, each different data system has its own access controls and logging capabilities. Therefore, if an organization needs to enforce centralized governance over all its data, it needs to build it in-house. That costs more money and time.

Even if you are already convinced of the importance of a universal database, we need to make one last vital remark. A universal database is unusable if it does not offer excellent performance for all the data types it is serving. In other words, a universal database must be performing as efficiently as the purpose-built ones, otherwise there will be a lot of skepticism on adopting it. And this is where the difficulty of universal databases lies and why no one had built such a system before TileDB.

In these docs you will be able to learn that multi-dimensional arrays are the right bet not only for their universality, but also for their performance. We will describe a lot of the critical decisions we took at TileDB when designing the array data model and an efficient on-disk format, as well as developing a powerful storage engine to support it.

For further reading on why we chose arrays as first-class citizens in TileDB, see our blog post Why Arrays as a Universal Data Model.

Data Model

Dense and sparse arrays

The basic array model we follow at TileDB is depicted below. We make an important distinction between a dense and a sparse array. A dense array can have any number of dimensions. Each dimension must have a domain with integer values, and all dimensions have the same data type. An array element is defined by a unique set of dimension coordinates and it is called a cell. In a dense array, all cells must store a value. A logical cell can store more than one value of potentially different types (which can be integers, floats, strings, etc). An attribute defines a collection of values that have the same type across all cells. A dense array may be associated with a set of arbitrary key-value pairs, called array metadata.

A sparse array is very similar to a dense array, but it has three important differences:

Cells in a sparse array can be empty.
The dimensions can have heterogeneous types, including floats and strings (i.e., the domain can be “infinite”).
Cell multiplicities (i.e., cells with the same dimension coordinates) are allowed.

The decision on whether to model your data with a dense or sparse array depends on the application and it can greatly affect performance. Also extra care should be taken when choosing to model a data field as a dimension or an attribute. These decisions are covered in detail in other sections of the docs, but for now, you should know this: array systems are optimized for rapidly performing range conditions on dimension coordinates. Arrays can also support efficient conditions on attributes, but by design the most optimized selection performance will come from querying on dimensions, and the reason will become clear soon.

Range conditions on dimensions are often called “slicing” and the results constitute a “slice” or “subarray”. Some examples are shown in the figure below. In numpy notation, A[0:2, 1:3] is a slice that consists of the values of cells with coordinates 0 and 1 on the first dimension, and coordinates 1 , and 2 on the second dimension (assuming a single attribute). Alternatively, this can be written in SQL as SELECT attr FROM A WHERE d1>=0 AND d1<=1 AND d2>=1 AND d2<=2 , where attr is an attribute and d1 and d2 the two dimensions of array A. Note also that slicing may contain more than one range per dimension (a multi-range slice/subarray).

The above model can be extended to include “dimension labels”. This extension can be applied to both dense and sparse arrays, but labels are particularly useful in dense arrays. Briefly stated, a dimension can accept a label vector which maps values of arbitrary data types to integer dimension coordinates. An example is demonstrated below. This is very powerful in applications where the data is quite dense (i.e., there are not too many empty cells), but the dimension fields are not integers or do not appear contiguously in the integer domain. In such cases, multi-dimensional slicing is performed by first efficiently looking up the integer coordinates in the label vectors, and then applying the slicing as explained above, which in the dense array case can be truly rapid.

Labeled dimensions are currently under development in TileDB. They will appear in a future release soon.

Groups

In many applications, it is useful to hierarchically organize different arrays into groups. TileDB incorporates the concept and functionality of groups into its Data Format.

Use Cases

Multi-dimensional arrays have been around for a long time. However, there have been two misconceptions about arrays:

Arrays are used solely in scientific applications. This is mainly due to their massive use in Python, Matlab, R, machine learning and other scientific applications. There is absolutely nothing wrong with arrays capturing scientific use cases. On the contrary, such applications are important and challenging, and there is no relational database that can efficiently accommodate them.
Arrays are only dense. Most array systems (i.e., storage engines or databases) built before TileDB focused solely on dense arrays. Despite their suitability for a wide spectrum of use cases, dense arrays are inadequate for sparse problems, such as genomics, LiDAR and tables. Sparse arrays have been ignored and, therefore, no array system was able to claim universality.

The sky is the limit in terms of applicability for a system that supports both dense and sparse arrays. An image is a 2D dense array, where each cell is a pixel that can store the RGBA color values. Similarly a video is a 3D dense array, two dimensions for the frame images and a third one for the time. LiDAR is a 3D sparse array with float coordinates. Genomic variants can be modeled by a 3D array where the dimensions are the sample name (string), the chromosome (string) and the position (integer). Time series tick data can be modeled by a 2D array, with time and tick symbol as labeled dimensions (this can of course be extended arbitrarily to a ND dense or sparse array). Similarly, weather data can be modeled with a 2D dense array with float labels (the lat/lon real coordinates). Graphs can be modeled as (sparse 2D) adjacency matrices. Finally, a flat file can be stored as a simple 1D dense array where each cell stores a byte.

What's next?

You may wonder how we can make all these decisions about dimensions vs. attributes and dense vs. sparse for each application. To answer that, we need to understand how dense and sparse arrays lay data out on the storage medium, and what factors affect performance when slicing, which is the focus of the key concepts and data format section:

In addition, check out the various TileDB use cases in more detail:

Genomics

We currently focus on the following use cases in Genomics:

Population Genomics

Our vision is to facilitate fast, large-scale genomics research at a fraction of the cost by providing infrastructure that is easy to setup and designed for extreme scale.

The problem

In a nutshell: Storing vast quantities of genomic variant samples in legacy file formats leads to slow access, especially on the cloud. Merging the files into multi-sample files is not scalable or maintainable, resulting in the so called N+1 problem.

Single-sample VCF

Genomic analyses performed on collections of single-sample VCF files typically involve retrieving variant information from particular genomic ranges, for a specific subsets of samples, along with any of the provided VCF fields/attributes. Such queries are often repeated millions of times, so it is imperative that data retrieval is performed in a fast, scalable, and cost-effective manner.

However, accessing random locations in thousands—or hundreds of thousands—of different files is prohibitively expensive, both computationally and monetarily. This is especially true on cloud object stores, where each byte range in a file is a separate request that goes over the network. Not only does this introduce non-negligible latency, it can also incur significant costs as cloud object stores charge for every such request. For a typical analysis involving millions of requests on large collections of VCF files, this quickly becomes unsustainable.

Multi-sample VCF

The problems with single-sample VCF collections was the motivation behind multi-sample VCF, or project/population VCF (pVCF), files in which the entire collection of single-sample VCFs is merged into a single file. When indexed, specific records can be located within a pVCF file very quickly, as data retrieval is reduced to a simple, super-fast linear scan, minimizing latency, and significantly boosting I/O performance. However, this approach comes with significant costs.

First, the size of multi-sample pVCF files can scale superlinearly with the number of included samples. The problem is that individual VCF files contain very sparse data, which the pVCF file densifies by adding dummy information to denote variants missing from the original single-sample VCF file. This means that the combined pVCF solution is not scalable because it can lead to an explosion of storage overhead and high merging cost for large population studies.

Another problem is that the multi-sample pVCF file cannot be updated. Due to the fact that the sample-related information is listed in the last columns of the file, a new sample insertion will need to create a new column and inject values at the end of every line. This effectively means that a brand new pVCF will need to be generated with every update. If pVCF is large (typically in the order of many GBs or even TBs), the insertion process will be extremely slow. This is often referred to as the N+1 problem.

Data accessibility

Furthermore, domain-specific tools either re-invent a lot of the work that has been done in general-purpose data science tools, or they miss out on them. This leads the researchers to writing custom code for converting the VCF data into a format that a programming language (e.g., Python and R) or a data science tool (e.g., pandas or Spark) understands. This data access and conversion cost often becomes the bottleneck, instead of the actual analysis.

The solution with TileDB

Modeling with 3D sparse arrays

Population variant data can be efficiently represented using a 3D sparse array. For each sample, imagine a 2D plane where the vertical axis is the contig and the horizontal axis is the genomic position. Every variant can be represented as a range within this plane; it can be unary (i.e., a SNP) or it can be a longer range (e.g., INDEL or CNV). Each sample is then indexed by a third dimension, which is unbounded to accommodate populations of any size. The figure below shows an example for one sample, with several variants distributed across contigs chr1 , chr2 and chr3.

In TileDB-VCF, we represent the start position of each range as a non-empty cell in a sparse array (black squares in the above figure). In each of those array cells, we store the end position of each cell (to create a range) along with all other fields in the corresponding single-sample VCF files for each variant (e.g., REF, ALT, etc.). Therefore, for every sample, we map variants to 2D non-empty sparse array cells.

To facilitate rapid retrieval of interval intersections (explained in the next section), we also inject anchors (green square in the above figure) to breakup long ranges. Specifically, we create a new non-empty cell every anchor_gap bases from the start of the range (where anchor_gap is a user-defined parameter), which is identical to the range cell, except that (1) it has a new start coordinate and (2) it stores the real start position in an attribute.

Note that regardless of the number of samples, we do not inject any additional information other than that of the anchors, which is user configurable and turns out to be negligible for real datasets. In other words, this solution leads to linear storage in the number of samples, thus being scalable.

Fast retrieval

The typical access pattern used for variant data involves one or more rectangles covering a set of genomic ranges across one or more samples. In the figure below, let the black rectangle be the user's query. Observe that the results are highlighted in blue (v1, v2, v4, v7). However, the rectangle misses v1, i.e., the case where an Indel/CNV range intersects the query rectangle, but the start position is outside the rectangle.

This is the motivation behind anchors. TileDB-VCF expands the user's query range on the left by anchor_gap. It then reports as results the cells that are included in the expanded query if their end position (stored in an attribute) comes after the query range start endpoint. In the example above, TileDB-VCF retrieves anchor a1 and Indel/CNV v3. It reports v1 as a result (as it can be extracted directly from anchor a1), but filters out v3.

But what about updates? That's the topic of the next sub section.

Updates

Interoperability

Being able to access the data in various programming languages and computational frameworks unlocks a vast opportunity for utilizing the numerous existing open-source data science and machine learning tools.

What's next?

You can find more detailed documentation about TileDB-VCF (including the installation instructions, API reference and How To guides) in the following dedicated section:

Architecture

TileDB is also optimized for a variety of storage backends. It works on any POSIX filesystem (e.g., Lustre) and HDFS, but extra care has been taken in both the data format and engine to make it ideal for object stores, such as AWS S3, Azure Blob Storage, Google Cloud Storage and Minio. Finally, TileDB supports a special RAM backend to store and access arrays exclusively in memory for latency demanding applications.

In addition to efficient array storage and access functionality, TileDB offers some additional important features:

Versioning and time traveling via immutable writes (along with consolidation options)
Filter condition push-downs, moving compute closer to the data for improved performance

We are working on pushing more and more computational operations down to the storage engine (such as aggregates, group-bys, linear algebra operations and more).

Key Concepts & Data Format

It is important to understand the key concepts of TileDB and the way they are reflected on persistent storage to take full advantage of the power of the TileDB Open Source engine.

The TileDB data format implements the data model and has the following high-level goals:

Efficient storage with support for compressors, encryption, etc.
Efficient multi-dimensional slicing (i.e., fast reads)
Efficient access on cloud object stores (in addition to other backends)
Support for data versioning and time traveling (built into the format)

The TileDB data format is open-spec and can be found here (development branch), or in path https://github.com/TileDB-Inc/TileDB/blob/<version>/format_spec/FORMAT_SPEC.md where <version> is the TileDB Open Source version you are looking for (e.g., 2.6.3).

Arrays

TileDB employs a multi-file data format, organized in directory hierarchies (or object prefix hierarchies on cloud object stores). A multi-file format is absolutely necessary for supporting fast concurrent writes, especially on cloud object stores (e.g., AWS S3) where all objects are immutable (i.e., to change a single byte in a TB-long object, you end up rewriting the entire TB-long object).

The general file structure for arrays looks as follows:

my_array
    ├── __commits
    ├── __fragment_meta
    ├── __fragments
    ├── __meta
    └── __schema

The main array components are the following:

Array schema (__schema)
Fragments (__fragments)
Consolidated fragment metadata (__fragment_meta)
Commits (__commits)
Array metadata (__meta)

The fact that the TileDB format is directory-based allows for very fast and easy copying between storage backends. For example, you can transfer an entire array or group between your local storage and AWS S3 using CLI command aws s3 sync src_uri dest_uri.

Arrays can be hierarchically organized into groups.

Array schema

The array schema contains important information about the array definition, such as the number, type and names of dimensions, number, type and names of attributes, the domain, and more.

The file structure within the array schema directory is as follows:

my_array                            # array directory
   ├──  ...
   └── __schema                      # array schema directory
         ├── <timestamped_name>      # array schema file
         └── ...

<timestamped_name> has format __timestamp_timestamp_uuid, where:

timestamp is timestamp in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)
uuid is a unique identifier

The timestamped files allow for versioning and time traveling. Each fragment (explained below) is associated with one of these array schema files.

Fragments

A fragment represents a timestamped write to a TileDB array. The figure below shows writes performed in a dense and a sparse array at two different timestamps t1 and t2 > t1. The logical array view when the user reads the array at any timestamp t3 >= t2 contains all the cells written in the array before t3, with the more recent cells overwriting the older cells. In the special case of sparse arrays that accepts duplicates (which can be specified in the array schema), if a cell was written more than one times, all cell replicas are maintained and returned to the user upon reads.

A fragment is stored as a directory with the following structure:

my_array                              # array directory
   ├──  ...
   ├── __fragments
   │   └── <timestamped_name>          # fragment directory
   │      ├── __fragment_metadata.tdb  # fragment metadata
   │      ├── a0.tdb                   # fixed-sized attribute 
   │      ├── a1.tdb                   # var-sized attribute (offsets) 
   │      ├── a1_var.tdb               # var-sized attribute (values)
   │      ├── ...         
   │      ├── a2_validity.tdb          # validity of fixed- or var-sized attribute
   │      ├── ...      
   │      ├── d0.tdb                   # fixed-sized dimension 
   │      ├── d1.tdb                   # var-sized dimension (offsets) 
   │      ├── d1_var.tdb               # var-sized dimension (values)
   │      └── ...      
   └── ...

<timestamped_name> has format __t1_t2_uuid_v, where:

t1 and t2 are timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)
uuid is a unique identifier
v is the format version

The fragment metadata file (__fragment_metadata.tdb) stores important data about the fragment, such as the name of its array schema, its non-empty domain, indexes, and other information that facilitates fast reads.

The array cell data (attribute values and dimension coordinates) are stored in files inside the fragment directory. There are the following types of files:

fixed-sized attribute data files, named a1.tdb, a2.tdb, ...
var-sized attribute data files, which are pairs of the form (a1.tdb, a1_var.tdb), (a2.tdb, a2_var.tdb), ... . The second *_var.tdb file of the pair contains the var-sized values, whereas the first contains the starting byte offsets of each value in the second file.
fixed-sized dimension data files, named d1.tdb, d2.tdb, ...
var-sized dimension data files, which are pairs of the form (d1.tdb, d1_var.tdb), (d2.tdb, d2_var.tdb), ... The second *_var.tdb file of the pair contains the var-sized values, whereas the first contains the starting byte offsets of each value in the second file.
validity data files for nullable attributes, named a1_validity.tdb, a2_validity.tdb, ...., associated with attribute a1, a2, ...., respectively. The validity files are simple bytevectors that indicate whether a cell value is null or not. The validity files are applicable to both fixed- and var-sized attributes, but they are not applicable to dimensions. They are also optional; the user may or may not specify an attribute as nullable.

A dense array does not materialize the dimension coordinates, whereas a sparse array must. The figure below shows a dense and a sparse array example. Observe that each cell value across all dimensions and attributes appears in the same absolute position across the corresponding files.

This layout where values of the same type are grouped together is ideal for compression, vectorization and subsetting on attributes/dimensions during queries (similar to other columnar databases).

The layout of the cell attribute values and dimension coordinates in the described files is very important for maximizing read performance, and is explained in detail in section Data layout along with the concept of tiles.

Consolidated Fragment Metadata

Each fragment metadata file (__fragment_metadata.tdb) contains a small footer with lightweight indexing information, such as the non-empty domain of the fragment. This information can serve as another layer of indexing when issuing a slicing (read) query.

When opening the array to issue a read query, TileDB brings in main memory this lightweight indexing information that facilitates rapid result retrieval. If there are numerous fragments in the array, retrieving the footer from each fragment metadata file may be time consuming.

To mitigate this issue, TileDB offers consolidation of fragment metadata. This operation groups the footers from all fragment metadata files in a single file, stored in folder __fragment_meta as shown below.

my_array                                 # array directory
   ├──  ...
   └── __fragment_meta                   # consolidated fragment metadata directory
         ├── <timestamped_name>.meta     # consolidated fragment metadata file
         └── ...

<timestamped_name>.meta has format __t1_t2_uuid_v, where:

t1 and t2 are the timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC) of the oldest and most recent fragment whose fragment metadata footer was consolidated in this file
uuid is a unique identifier
v is the format version

Upon opening the array, TileDB first reads the __fragment_meta/*.metafiles to retrieve the footers, eliminating the need to get the individual footers from each fragment metadata file, resulting in a considerable boost in array opening performance.

Commits

When a new fragment is successfully created, a file <timestamped_name>.wrt is created, where <timestamped_name> is the same as the name of the corresponding fragment folder created in __fragments. Since there may be numerous fragments created in an array, TileDB allows for consolidating the commit files into a single file <timestamped_name>.con, which contains the names of the fragments whose commits are being consolidated. The name of the consolidated file contains the timestamps of the first and last commit file it consolidated. The consolidated commit file helps reduce the number of __commits/*.wrt files, which further boosts the performance of opening the array for reading.

my_array                                 # array directory
   ├──  ...
   └── __commits  .                      # commit directory
         ├── <timestamped_name>.wrt      # commit file
         ├── <timestamped_name>.con      # consolidated commit file
         └── ...

Array metadata

TileDB supports storing array metadata in the form of key-value pairs in each array. This metadata is stored in a directory called __meta inside the array directory, with all key and value binary data items serialized into a generic tile with GZIP compression. Those files are timestamped in the same manner as fragments for the same reasons (immutability, concurrent writes and time traveling). The metadata file organization is shown below.

my_array                              # array folder
   ├──  ...
   └── __meta                         # array metadata folder
         ├── <timestamped_name>       # array metadata file
         ├── ...  
         ├── <timestamped_name>.vac   # vacuum file
         └── ...

<timestamped_name> has format __t1_t2_uuid_v, where:

t1 and t2 are timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)
uuid is a unique identifier
v is the format version

The vacuum files *.vac are explained in section Consolidation.

Groups

The general file structure for groups looks as follows:

my_group
├── __group
│   └── __<timestamped_name>
├── __meta
└── __tiledb_group.tdb

There are three components:

__tiledb_group.tdb is an empty file indicating that my_group is a TileDB group.
__meta stores key-value metadata associated with the group. This functionality is identical to that of array metadata.
__group contains timestamped files (with a similar structure to those described above for all other components, such as fragments, consolidated fragment metadata, etc), which store the absolute paths of other arrays and groups.

Note that a group can logically contain other nested groups and arrays. Their paths are included in the files in __group. However, the physical location of the actual groups and arrays may be in paths other than inside the group location on storage. This provides a lot of flexibility in dynamically grouping various arrays and groups, especially on (cloud) object storage, without physically moving enormous quantities of data from one physical path to another.

Data layout

Typical queries slice subsets of the array values (i.e., they select subarrays). Therefore, TileDB does not compress the whole attribute / dimension files as single blobs. Instead, it tiles (i.e., chunks) them into smaller blocks. The cells in each tile will appear as contiguous in the corresponding files as shown in the figure above.

A tile is the atomic unit of IO and compression, (as well as other potential filters, such as encryption, checksum, etc.).

A question arises: how do we sort the values from a multi-dimensional space, inside the files that are single-dimensional. This order will dictate the position of each cell within its corresponding tile, and the position of the tiles in the file. We call this order the global order of the cells in the array. The global order and tiling for dense and sparse arrays will be explained separately.

In a dense array, the global order and tiling is determined by 3 parameters:

The space tile extent per dimension.
The cell order inside each tile
The tile order

The figure below depicts three different orders as we vary the above parameters. Observe that the shape and size of each tile is dictated solely by the space tile extents.

The sparse array case is slightly different, since tiling only based on space tiles may lead to highly imbalanced numbers of non-empty cells in each tile, which can further impact compressibility and slicing performance.

In sparse arrays, the global order could be determined as follows:

By specifying the same three parameters as in the dense case, or
By using a Hilbert space-filling curve

Once the global order is determined, the tiling is specified by an extra parameter, called the tile capacity, i.e., the number of non-empty cells in each tile. The figure below depicts different global orders for a different choice of all the above mentioned parameters for sparse arrays (a non-empty cell is depicted in dark blue).

Why is the global order and tiling such a big deal? The global order should retain as much as possible the co-locality of your query results, for the majority of your typical slicing shapes. Remember, the array is multi-dimensional, whereas the file storing the array data is single-dimensional. You have a single chance (unless you want to pay for redundancy) to sort your data in a single 1D order. And that order absolutely dictates the performance of your multi-dimensional queries. The reason is that the closer your results appear in the file, the faster the IO operations to retrieve them. Also the size of the tile can affect performance, since integral tiles will be fetched from storage to memory. The examples below demonstrate some good and bad global orders and tilings for a given slice, focusing on a dense array (similar arguments can be made for the sparse case).

Indexing

Now that we have explained the on-disk format, how do we efficiently slice an array and what indices does TileDB build to facilitate the query? First we focus on a dense array and use the example in the figure below. In addition to the slicing query, we know the following from the array schema: the number of dimensions, the global order, the tiling, and the fact that there are no empty cells in the array. Using solely the array schema and with simple arithmetic, we can calculate the number, size and location of cell slabs (i.e., sets of contiguous cells on disk) that comprise the query result, without any additional index. TileDB implements an efficient multi-threaded algorithm that can fetch the relevant tiles from disk, decompress, and copy the cell slabs into the result buffers, all in parallel.

Slicing in sparse arrays is more difficult because we do not know the location of empty cells until the array is written. Therefore, unlike dense arrays, we need to explicitly store the coordinates of the non-empty cells, and build an index on top of them. The index must be small in size so that it can be quickly loaded in main memory when the query is submitted. In TileDB we use an R-tree as an index. The R-tree groups the coordinates of the non-empty cells into minimum bounding rectangles (MBRs), one per tile, and then recursively groups those MBRs into a tree structure. The figure below shows an example. The slicing algorithm then traverses the R-tree to find which tile MBRs overlap the query, fetches in parallel those tiles from storage, and decompresses them. Then for every partially overlapping tile, the algorithm needs to further check the coordinates one by one to determine whether they fall inside the query slice. TileDB implements this simple algorithm with multi-threading and vectorization, leading to extremely efficient multi-dimensional slicing in sparse arrays.

The above indexing information, along with other auxiliary data (e.g., byte offsets of tiles in the files on disk) is stored in the fragment metadata file of each fragment.

Consolidation

Fragment metadata

The TileDB format enables an important optimization, which is especially pronounced in cloud object stores. Since an array may consist of numerous fragments, a slicing query should be able to discard irrelevant fragments as quickly as possible. One way to do this is by checking the overlap of the slice with the non-empty domain of each fragment. This information is stored in the __fragment_metadata.tdb files in the fragment folders. If there are numerous fragments, then there will be numerous REST requests to the object store to retrieve this information.

To solve this issue, TileDB enables consolidating the footers of the fragment metadata files into a single file with format __t1_t2_uuid_v.meta. The file contains the fragment URIs whose footers are included, along with the footers in serialized binary form. The footers contain only very small information about the fragments, such as the non-empty domain and other light metadata. t1 is the first timestamp of the first fragment whose metadata is being consolidated, and t2 is the second timestamp of the last fragment. Upon opening an array, regardless of the number of fragments, TileDB can fetch this single small file in main memory with a single REST request. In other words, the TileDB format has mechanisms for making the communication protocol with the object store more lightweight.

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   ├── __fragments                  # fragment directory 
   │    ├── __t1_t1_uuid1_v         # fragment
   │    └── __t2_t2_uuid2_v         # fragment            
   └── __fragment_meta              # consol. fragment metadata directory

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   ├── __fragments                  # fragment directory 
   │    ├── __t1_t1_uuid1_v         # fragment
   │    └── __t2_t2_uuid2_v         # fragment            
   └── __fragment_meta              # consol. fragment metadata directory
         ├── __t1_t2_uuid3_v.meta   # consol. fragment metadata file
         └── ...

Fragments

The concept of immutable fragments allows you to write concurrently to a single array, but also to time travel (i.e., see versions of the array in between updates). However, numerous fragments may take a toll on performance, as cell values across fragments lose spatial locality in the underlying files (leading to more expensive IO). To mitigate this problem, TileDB supports fragment consolidation, a process that merges a subset of fragments into a single one. The new fragment incorporates in its name the time range between the first and the last fragment it encompasses, e.g., the name may look like __t1_t2_uuid3_v, wheret1 is the first timestamp of the first fragment being consolidated, and t2 is the second timestamp of the last fragment.

Along with the consolidated fragment, consolidation produces also a vacuum file __t1_t2_uuid3_v.vac, i.e., with the same name as the consolidated fragment with added suffix .vac. This file contains the URIs of the fragments that were consolidated by fragment __t1_t2_uuid3_v, and it is stored in the __commits folder. The user can then vacuum them, i.e., permanently delete the consolidated fragments. The __commits/*.vac files are used in the vacuum process so that only consolidated fragments get deleted.

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.vac     # vacuum file
   │    ├── __t1_t2_uuid3_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        ├── __t1_t2_uuid3_v         # consolidated fragment
        └── __t2_t2_uuid2_v         # fragment

Commits

The commit files can also be consolidated. Upon commit consolidation, TileDB produces a __t1_t2_uuid_v.con file in folder __commits, which stores the URIs of the fragments whose commits were consolidated in this file. The consolidated commits can then be vacuumed, leaving a single commit file in the __commits folder. That leads to a significant boost of opening the array for reads in the presence of a large number of fragments.

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.con     # consolidated commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        └── __t2_t2_uuid2_v         # fragment

Array metadata

The array metadata consolidation is very similar to fragment consolidation. An example is shown below. The produced __t1_t2_uuid3.vac files is used in vacuuming to delete the array metadata files that participated in consolidation.

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    ├── __t1_t1_uuid1           # array metadata file
   │    └── __t2_t2_uuid2           # array metadata file       
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    ├── __t1_t1_uuid1           # array metadata file
   │    ├── __t1_t2_uuid3           # consolidated array metadata file
   │    ├── __t1_t2_uuid3.vac       # vacuum file
   │    └── __t2_t2_uuid2           # array metadata file       
   └──  ...

Vacuuming

The vacuuming process permanently deletes consolidated fragments, fragment metadata, commits and array metadata, in order to clean up space. TileDB separates consolidation from vacuuming, in order to make consolidation process-safe in the presence of concurrent reads and writes. On the other hand, vacuuming is not process-safe and the user should take extra care when invoking it.

Fragment metadata

Fragment metadata consolidation can be run multiple times, resulting in multiple __fragment_meta/*.meta files. TileDB allows vacuuming those fragment metadata files by keeping the latest *.meta file and deleting the rest. An example is shown below.

my_array                            # array directory
   ├──  ...
   ├── __fragmemt_meta              # fragment metadata directory 
   │    ├── __t1_t2_uuid1_v.meta    # consol. fragment metadata file
   │    └── __t3_t4_uuid2_v.meta    # consol. fragment metadata file
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __fragmemt_meta              # fragment metadata directory 
   │    └── __t3_t4_uuid2_v.meta    # consol. fragment metadata file
   └──  ...

Fragments

Vacuuming for fragments results in deleting fragment that took part in a consolidation process. We distinguish two cases.

In the first case, suppose that no commits have been consolidated. Vacuuming will delete the two fragment that were consolidated from the __fragmentsfolder, the two corresponding commit *.wrt files and the *.vac file from __commits folder. An example is shown below.

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.vac     # vacuum file
   │    └── __t1_t2_uuid3_v.wrt     # commit file       
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        ├── __t1_t2_uuid3_v         # consolidated fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    └── __t1_t2_uuid3_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        └── __t1_t2_uuid3_v         # consolidated fragment

In the second case, suppose that the commits have been consolidated and vacuumed for simplicity, after writing two fragments and running consolidation on the fragments. Vacuuming will delete the two fragment that were consolidated from the __fragmentsfolder, and the *.vac file from the __commits folder as in the first case. But this time, it will also create a new file __commits/__t1_t2_uuid5_v.ign. This file indicates that the vacuumed fragment URIs should be ignored from the consolidated commit file __t1_t2_uuid4_v.con upon opening the array for reading.

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t2_uuid3_v.vac     # vacuum file
   │    └── __t1_t2_uuid4_v.con     # consolidated commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        ├── __t1_t2_uuid3_v         # consolidated fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t2_uuid5_v.ign     # ignore commits file
   │    └── __t1_t2_uuid4_v.con     # consolidated commit file       
   └── __fragments                  # fragment directory 
        └── __t1_t2_uuid3_v         # consolidated fragment

Commits

Vacuuming commits deletes the commit files of the fragments which participated in a commit consolidation process (included in __commits/*.com files). An example is shown below:

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.con     # consolidated commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    └── __t1_t2_uuid3_v.con     # consolidated commit file 
   └──  ...

Array metadata

Vacuuming consolidated array metadata deletes the array metadata files that participated in the consolidation, plus the corresponding *.vac files. An example is shown below.

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    ├── __t1_t1_uuid1           # array metadata file
   │    ├── __t1_t2_uuid3           # consolidated array metadata file
   │    ├── __t1_t2_uuid3.vac       # vacuum file
   │    └── __t2_t2_uuid2           # array metadata file       
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    └── __t1_t2_uuid3           # consolidated array metadata file
   └──  ...

Tile filtering

TileDB allows the user to specify an ordered list of data transformations (such as compression, encryption, etc.) that can be applied to data tiles before they are written to disk. The user may define different filters for each attribute, the coordinates and the data tile offsets, all of which are specified in the array schema.

TileDB supports a number of filters, described below, and more will continue to be added in the future.

Compression filters

There are several filters performing generic compression, which are the following:

GZIP: Compresses with Gzip
ZSTD: Compresses with Zstandard
LZ4: Compresses with LZ4
RLE: Compresses with run-length encoding
BZIP2: Compresses with Bzip2
DOUBLE_DELTA: Compresses with double-delta encoding

Byteshuffle

This filter performs byte shuffling of data as a way to improve compression ratios. The byte shuffle implementation used by TileDB comes from the Blosc project.

The byte shuffling process rearranges the bytes of the input attribute cell values in a deterministic and reversible manner designed to result in long runs of similar bytes that can be compressed more effectively by a generic compressor than the original unshuffled elements. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.

For example, consider three 32-bit unsigned integer values 1, 2, 3, which have the following little-endian representation when stored adjacent in memory:

0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 0x00 0x00 0x00

The byte shuffle operation will rearrange the bytes of these integer elements in memory such that the resulting array of bytes will contain each element’s first byte, followed by each element’s second byte, etc. After shuffling the bytes would therefore be:

0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Observe the longer run of zero-valued bytes, which will compress more efficiently.

Bitshuffle

This filter performs bit shuffling of data as a way to improve compression ratios. The bitshuffle implementation used in TileDB comes from https://github.com/kiyo-masui/bitshuffle.

Bitshuffling is conceptually very similar to byteshuffling, but operates on the bit granularity rather than the byte granularity. Shuffling at the bit level can increase compression ratios even further than the byteshuffle filter, at the cost of increased computation to perform the shuffle.

Typically this filter is not used on its own, but rather it is immediately followed by a compression filter in a filter list.

Positive-delta encoding

This filter performs positive-delta encoding. Positive-delta encoding is a form of delta encoding that only works when the delta value is positive. Positive-delta encoding can result in better compression ratios on the encoded data. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.

For example, if the data being filtered was the sequence of integers 100, 104, 108, 112, ..., then the resulting positive-encoded data would be 0, 4, 4, 4, .... This encoding is advantageous in that producing long runs of repeated values can result in better compression ratios, if a compression filter is added after positive-delta encoding.

The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.

Positive-delta encoding is particularly useful for the offsets of variable-length attribute data, which by definition will always have positive deltas. The above example of the form 100, 104, 108, 112 can easily arise in the offsets, if for example you have a variable-length attribute of 4-byte values with mostly single values per cell instead of a variable number.

Bit width reduction

This filter performs bit-width reduction, which is a form of compression.

Bit-width reduction examines a window of attribute values, and determines if all of the values in the window can be represented by a datatype of smaller byte size. If so, the values in the window are rewritten as values of the smaller datatype, potentially saving several bytes per cell.

For example, consider an attribute with datatype uint64. Initially, each cell of data for that attribute requires 8 bytes of storage. However, if you know that the actual value of the attribute is often 255 or less, those cells can be stored using just a single byte in the form of a uint8, saving 7 bytes of storage per cell. The bit-width reduction filter performs this analysis and compression automatically over windows of attribute data.

Additionally, each cell value in a window is treated relative to the minimum value in that window. For example, if the window size was 3 cells, which had the values 300, 350, 400, the bit-width reduction filter would first determine that the minimum value in the window was 300, and the relative cell values were 0, 50, 100. These relative values are now less than 255 and can be represented by a uint8 value.

If possible, it can be a good idea to apply positive-delta encoding before bit-width reduction, as the positive-delta encoding may further increase the opportunities to store windows of data with a narrower datatype.

Bit-width reduction only works on integral datatypes.

Encryption and checksums

TileDB allows you to encrypt your arrays at rest. It currently supports a single type of encryption, AES-256 in the GCM mode, which is a symmetric, authenticated encryption algorithm. When creating, reading or writing arrays you must provide the same 256-bit encryption key. The authenticated nature of the encryption scheme means that a message authentication code (MAC) is stored together with the encrypted data, allowing verification that the persisted ciphertext was not modified.

TileDB also has checksum support for checking data integrity. It currently supports MD5 and SHA256.

Crypto libraries used:

macOS and Linux: OpenSSL
Windows: Next generation cryptography (CNG)

Internal Mechanics

This group of pages describes the internal mechanics of the TileDB Open Source storage engine. It is meant for more advanced users that would like to better understand how we implement the array model and format to achieve excellent performance and provide features such as atomicity, concurrency, eventual consistency, data versioning, time traveling, and consolidation.

Writing

TileDB is architected to support parallel batch writes, i.e., writing collections of cells with multiple processes or threads. Each write operation creates one or more dense or sparse fragments. Updating an array is equivalent to initiating a new write operation, which could either insert cells in unpopulated areas of the domain or overwrite existing cells (or a combination of the two). TileDB handles each write separately and without any locking. Each fragment is immutable, i.e., write operations always create new fragments, without altering any other fragment.

Dense Writes

A dense write is applicable to dense arrays and creates one or more dense fragments. In a dense write, the user provides:

The subarray to write into (it must be single-range).
The buffers that contain the attribute values of the cells that are being written.
The cell order within the subarray (which must be common across all attributes), so that TileDB knows which values correspond to which cells in the array domain. The cell order may be row-major, column-major, or global.

The example below illustrates writing into a subarray of an array with a single attribute. The figure depicts the order of the attribute values in the user buffers for the case of row- and column-major cell order. TileDB knows how to appropriately re-organize the user-provided values so that they obey the global cell order before storing them to disk. Moreover, note that TileDB always writes integral space tiles to disk. Therefore, it will inject special empty values (depicted in grey below) into the user data to create full data tiles for each space tile.

Writing in the array global order needs a little bit more care. The subarray must be specified such that it coincides with space tile boundaries, even if the user wishes to write in a smaller area within that subarray. The user is responsible for manually adding any necessary empty cell values in her buffers. This is illustrated in the figure below, where the user wishes to write in the blue cells, but has to expand the subarray to coincide with the two space tiles and provide the empty values for the grey cells as well. The user must provide all cell values in the global order, i.e., following the tile order of the space tiles and the cell order within each space tile.

Writing in global order requires knowledge of the space tiling and cell/tile order, and is rather cumbersome to use. However, this write mode leads to the best performance, because TileDB does not need to internally re-organize the cells along the global order. It is recommended for use cases where the data arrive already grouped according to the space tiling and global order (e.g., in geospatial applications).

Default Fill Values

TileDB uses the following default fill values for empty cells in dense writes, noting that the user can specify any other fill value upon array creation:

Datatype

Default fill value

TILEDB_CHAR

Minimum char value

TILEDB_INT8

Minimum int8 value

TILEDB_UINT8

Maximum uint8 value

TILEDB_INT16

Minimum int16 value

TILEDB_UINT16

Maximum uint16 value

TILEDB_INT32

Minimum int32 value

TILEDB_UINT32

Maximum uint32 value

TILEDB_INT64

Minimum int64 value

TILEDB_UINT64

Maximum uint64 value

TILEDB_FLOAT32

NaN

TILEDB_FLOAT64

NaN

TILEDB_ASCII

0

TILEDB_UTF8

0

TILEDB_UTF16

0

TILEDB_USC2

0

TILEDB_USC4

0

TILEDB_ANY

0

TILEDB_DATETIME_*

Minimum int64 value

In the case a fixed-sized attribute stores more than one values, all the cell values will be assigned the corresponding default value shown above.

Sparse Writes

Sparse writes are applicable to sparse arrays and create one or more sparse fragments. The user must provide:

The attribute values to be written.
The coordinates of the cells to be written.
The cell layout of the attribute and coordinate values to be written (must be the same across attributes and dimensions). The cell layout may be unordered or global.

Note that sparse writes do not need to be constrained in a subarray, since they contain the explicit coordinates of the cells to write into. The figure below shows a sparse write example with the two cell orders. The unordered layout is the easiest and most typical. TileDB knows how to appropriately re-organize the cells along the global order internally before writing the values to disk. The global layout is once again more efficient but also more cumbersome, since the user must know the space tiling and the tile/cell order of the array, and manually sort the values before providing them to TileDB.

Consolidation

The presence of numerous fragments may impact the TileDB read performance. This is because many fragments would lead to fragment metadata being loaded to main memory from numerous different files in storage. Moreover, the locality of the result cells of a subarray query may be destroyed in case those cells appear in multiple different fragment files, instead of concentrated byte regions within the same fragment files.

To mitigate this problem, TileDB has a consolidation feature, which allows you to merge

Lightweight fragment metadata footers into a single file.
A subset of fragments into a single fragment.
A subset of array metadata files into a single one.

Consolidation is thread-/process-safe and can be done in the background while you continue reading from the array without being blocked. Moreover, consolidation does not hinder the ability to do time traveling at a fine granularity, as it does not delete fragments that participated in consolidation (and, therefore, they are still queryable). The user is responsible for vacuuming fragments, fragment metadata and array metadata that got consolidated to save space, at the cost of not being able to time travel across the old (finer) fragments.

Fragment metadata consolidation

Each fragment metadata file (located in a fragment folder) contains some lightweight information in its footer. This is mostly the non-empty domain and offsets for other metadata included in other parts of the file. If there are numerous fragments, reading the array may be slow on cloud object stores due to the numerous REST requests to fetch the fragment metadata footers. TileDB offers a consolidation process (with mode fragment_meta), which merges the fragment metadata footers of a subset of fragments into a single file that has suffix .meta, stored in the array folder. This file is named similarly to fragments, i.e., it carries a timestamp range that helps with time traveling. It also contains all the URIs of the fragments whose metadata footers are consolidated in that file. Upon reading an array, only this file is efficiently fetched from the backend, since it is typically very small in size (even for hundreds of thousands of fragments).

Fragment consolidation

If mode fragments is passed to the consolidation function, then the fragment consolidation algorithms is executed, which is explained in detail below.

Basic concepts

There are two important points to stress regarding fragment consolidation:

Consolidating dense fragments produces a dense fragment, and may induce fill values.
Consolidating fragments where all fragments are sparse produces a sparse fragment.

The figure below shows consolidation of two dense fragments, the first containing only full tiles, and the second containing two tiles with a single cell written to each. Note that this can occur only in dense arrays, since sparse arrays can have only sparse fragments. The array in the figure has a 2x2 space tiling. Recall that a dense fragment consists of a dense hyper-rectangle and that it stores only integral tiles. Due to the partial cell in the second fragment that is located in the lower left space tile, the dense hyper-rectangle of the produced consolidated dense fragment must cover all four space tiles. Therefore, TileDB must fill the empty cells in this hyper-rectangle with empty values, illustrated in grey color in the figure below.

Consolidating only sparse fragments is simpler. The figure below illustrates consolidation of two sparse fragments, where the resulting consolidated fragment is also sparse and there is no injection of empty values.

Recall that each fragment is associated with its creation timestamp upon writing. A consolidated fragment instead is associated with the timestamp range that includes the timestamps of the fragments that produced it (see Consolidated Fragments). This is particularly important for time traveling, since opening an array at a timestamp will consider all the consolidated fragments whose end timestamp is at or before the query timestamp. In other words, although consolidation generally leads to better performance, it affects the granularity of time traveling.

Preprocessing

Before the consolidation algorithm begins, TileDB applies a simple optimization in a pre-processing step, which may lead to great performance benefits depending on the “shape” of the existing fragments. Specifically, TileDB identifies dense fragments whose non-empty domain completely covers older adjacent (dense or sparse) fragments, and directly deletes the old fragment directories without performing any actual consolidation.

This clean-up process is illustrated with an example in the figure below. Suppose the first fragment is dense and covers the entire array, i.e., [1,4], [1,4], the second is dense and covers [1,2], [1,2], the third is sparse as shown in the figure, and the fourth one is dense covering [1,2], [1,4]. Observe that, if those four fragments were to be consolidated, the cells of the second and third fragment would be completely overwritten from the cells of the fourth fragment. Therefore, the existence of those two fragments would make no difference to the consolidation result. Deleting them altogether before the consolidation algorithm commences will result in boosting the algorithm performance (since fewer cells will be read and checked for overwrites).

Algorithm

The consolidation algorithm is performed in steps. In each step, a subset of adjacent (in the timeline) fragments is selected for consolidation. The algorithm proceeds until a determined number of steps were executed, or until the algorithm specifies that no further fragments are to be consolidated. The choice of the next fragment subset for consolidation is based on certain rules and user-defined parameters, explained below. The number of steps is also configurable, controlled by sm.consolidation.steps.

Let us focus on a single step, during which the algorithm must select and consolidate a subset of fragments based on certain criteria:

The first criterion is if a subset of fragments is “consolidatable”, i.e., eligible for consolidation in a way that does not violate correctness. Any subset consisting of solely sparse fragments is always consolidatable. However, if a fragment subset contains one or more dense fragments, TileDB performs an important check; if the union of the non-empty domains of the fragments (which is equal to the non-empty domain of the resulting consolidated fragment) overlaps with any fragment created prior to this subset, then the subset is marked as non-consolidatable. Recall that the fragment that results from consolidating a subset of fragments containing at least one dense fragment is always a dense fragment. Therefore, empty regions in the non-emtpy domain of the consolidated fragment will be filled with special values. Those values may erroneously overwrite older valid cell values. Such a scenario is illustrated in the figure below. The second and third fragments are not consolidatable, since their non-empty domain contains empty regions that overlap with the first (older) fragment. Consequently, consolidating the second and third fragment results in a logical view that is not identical to the one before consolidation, violating correctness. This criterion detects and prevents such cases.

The second criterion is the comparative fragment size. Ideally, we must consolidate fragments of approximately equal size. Otherwise, we may end up in a situation where, for example, a 100GB fragment gets consolidated with a 1MB one, which would unnecessarily waste consolidation time. This is controlled by parameter sm.consolidation.step_size_ratio; if the size ratio of two adjacent fragments is smaller than this parameter, then no fragment subset that contains those two fragments will be considered for consolidation.
The third criterion is the fragment amplification factor, applicable to the case where the fragment subset to be consolidated contains at least one dense fragment. If the non-empty domain of the resulting fragment has too many empty cells, its size may become considerably larger than the sum of sizes of the original fragments to be consolidated. This is because the consolidated fragment is dense and inserts special fill values for all empty cells in its non-empty domain (see figure below). The amplification factor is the ratio between the consolidated fragment size and the sum of sizes of the original fragments. This is controlled by sm.consolidation.amplification, which should not be exceed for a fragment subset to be eligible for consolidation. The default value 1.0 means that the fragments will be consolidated if there is no amplification at all, i.e., if the size of the resulting consolidated fragment is smaller than or equal to the sum of sizes of the original fragments. As an example, this happens when the non-empty domain of the consolidated fragment does not contain any empty cells.

The fourth criterion is the collective fragment size. Among all eligible fragment subsets for consolidation, we must first select to consolidate the ones that have the smallest sum of fragment sizes. This will quickly reduce the number of fragments (hence boosting read performance), without resorting to costly consolidation of larger fragments.
The final criterion is the number of fragments to consolidate in each step. This is controlled by sm.consolidation.step_min_frags and sm.consolidation.step_max_frags; the algorithm will select the subset of fragments (complying with all the above criteria) that has the maximum cardinality smaller than or equal to sm.consolidation.step_max_frags and larger than or equal to sm.consolidation.step_min_frags. If no fragment subset is eligible with cardinality at least sm.consolidation.step_min_frags, then the consolidation algorithm terminates.

The algorithm is based on dynamic programming and runs in time O(max_frags * total_frags), where total_frags is the total number of fragments considered in a given step, and max_frags is equal to the sm.consolidation.step_max_frags config parameter.

When computing the union of the non-empty domains of the fragments to be consolidated, in case there is at least one dense fragment, the union is always expanded to coincide with the space tile extents. This affects criterion 1 (since the expanded domain union may now overlap with some older fragments) and 2 (since the expanded union may amplify resulting consolidated fragment size).

Array metadata consolidation

Similar to array fragments, array metadata can also be consolidated (with mode array_meta). Since the array metadata is typically small and can fit in main-memory, consolidating them is rather simple. TileDB simply reads all the array metadata (from all the existing array metadata fragments) in main memory, creates an up-to-date view of the metadata, and then flushes them to a new array metadata file that carries in its name the timestamp range determined by the first timestamp of the first array metadata and the second timestamp of the last array metadata files that got consolidated.

Vacuuming

Vacuuming applies to consolidated fragments, consolidated array metadata and consolidated fragment metadata as follows:

Fragments: During consolidation, a .vac file is produced with all the fragment URIs that participated in consolidation. When the vacuuming function is called with mode "fragments", all the fragment folders whose URI is in the .vac file get deleted.
Array metadata: During consolidation, a .vac file is produced with all the array metadata URIs that participated in consolidation. When the vacuuming function is called with mode "array_meta", all the array metadata files whose URI is in the .vac file get deleted.
Fragment metadata: Vacuuming simply deletes all .meta files except for the last one.

Reading

TileDB supports fast and parallel subarray reads, with the option to time travel, i.e., to read the array at a particular time in the past. The read algorithm is architected to handle multiple fragments efficiently and completely transparently to the user. To read an array, TileDB first "opens" the array and brings some lightweight fragment metadata in main memory. Using this metadata, TileDB knows which fragments to ignore and which to focus on, e.g., based on whether their non-empty domain overlaps with the query subarray, or whether the fragment was created at or before the time of interest. Moreover, in case consolidation has occurred, TileDB will be smart enough to ignore fragments that have been consolidated, by considering only the merged fragment that encompasses them.

When reading an array, the user provides:

a (single- or multi-range) subarray
the attributes to slice on (it can be any subset of the attributes, including the coordinates)
the layout with respect to the subarray to return the result cells in

The read algorithm is quite involved. It leverages spatial indexing to locate only the relevant data tiles to the slice, it makes sure it does not fetch a data tile twice in the case of multi-range queries, it performs selective decompression of tile chunks after a tile has been fetched from the backend, and it employs parallelism pretty much everywhere (in IO, decompression, sorting, etc).

Dense arrays

The figure below shows how to read the values of a single attribute from a dense array. The ideas extend to multi-attribute arrays and slicing on any subset of the attributes, including even retrieving the explicit coordinates of the results. The figure shows retrieving the results in 3 different layouts, all with respect to the subarray query. This means that you can ask TileDB to return the results in an order that is different than the actual physical order (which, recall, is always the global order), depending on the needs of your application.

You can also submit multi-range subarrays, as shows in the figure below. The supported orders here are row-major, column-major and unordered. The latter gives no guarantees about the order; TileDB will attempt to process the query in the fastest possible way and return the results in an arbitrary order. It is recommended to use this layout if you target at performance and you do not care about the order of the results. Also you can ask TileDB to return the explicit coordinates of the returned values if you wish to know which value corresponds to which cell.

Note that reading dense arrays always returns dense results. This means that, if your subarray overlaps with empty (non-materialized) cells in the dense array, TileDB will return default or user-defined fill values for those cells. The figure below shows an example.

Recall that all cells in a dense fragment must have a value, which TileDB materializes on disk. This characteristic of dense fragments is important as it considerably simplifies spatial indexing, which becomes almost implicit. Consider the example in the figure below. Knowing the space tile extent along each dimension and the tile order, we can easily identify which space tiles intersect with a subarray query without maintaining any complicated index. Then, using lightweight bookkeeping (such as offsets of data tiles on disk, compressed data tile size, etc.), TileDB can fetch the tiles containing results from storage to main memory. Finally, knowing the cell order, it can locate each slab of contiguous cell results in constant time (again without extra indexing) and minimize the number of memory copy operations.

Note that the above ideas apply also to dense fragments that populate only a subset of the array domain; knowing the non-empty domain, TileDB can use similar arithmetic calculations to locate the overlapping tiles and cell results.

Sparse arrays

The figure below shows an example subarray query on a sparse array with a single attribute, where the query requests also the coordinates of the result cells. Similar to the case of dense arrays, the user can request the results in layouts that may be different from the physical layout of the cells in the array (global order).

Sparse arrays accept multi-range subarray queries as well. Similar to the dense case, global order is not applicable here, but instead an unordered layout is supported that returns the results in an arbitrary order (again, TileDB will try its best to return the results as fast as possible in this read mode).

A sparse fragment differs from a dense fragment in the following aspects:

A sparse fragment stores only non-empty cells that might appear in any position in the domain (i.e., they may not be concentrated in dense hyper-rectangles)
In sparse fragments there is no correspondence between space and data tiles. The data tiles are created by first sorting the cells on the global order, and then grouping adjacent cell values based on the tile capacity.
There is no way to know a priori the position of the non-empty cells, unless we maintain extra indexing information.
A sparse fragment materializes the coordinates of the non-empty cells in data tiles.

TileDB indexes sparse non-empty cells with R-Trees. Specifically, for every coordinate data tile it constructs the minimum bounding rectangle (MBR) using the coordinates in the tile. Then, it uses the MBRs of the data tiles as leaves and constructs an R-Tree bottom up by recursively grouping MBRs into larger MBRs using a fanout parameter. The figure below shows an example of a sparse fragment and its corresponding R-Tree.

Given a subarray query, the R-Tree (which is small enough to fit in main memory) is used to identify the intersecting data tile MBRs. Then, the qualifying coordinate data tiles are fetched and the materialized coordinates therein are used to determine the actual results.

Time traveling

Recall that writing to TileDB arrays produces a number of timestamped fragments. TileDB supports reading an array at an arbitrary instance in time, by providing a timestamp upon opening the array for reading. Any fragment created after that timestamp will be ignored and the read will produce results as if only the fragments created at or before the given timestamp existed in the array. Time traveling applies to both dense and sparse arrays. The figure below shows an example of a dense array with 3 fragments, along with the results of a subarray depending on the timestamp the array gets opened with.

In the case of consolidation, time traveling works as follows:

If the user opens the array at a timestamp that is larger than or equal to the second timestamp of a fragment name, then that fragment will be considered in the read.
If the user opens the array at a timestamp that is smaller than the second timestamp of a fragment, then that fragment will be ignored.
If a fragment that qualifies for reading has been consolidated into another fragment that is considered for reading, then it will be ignored.

Incomplete Queries

There are situations where the memory allocated by the user to hold the result size is not enough for a given query. Instead of erroring out, TileDB gracefully handles these cases by attempting to serve a portion of the query and report back with an "incomplete" query status. The user should then consume the returned result and resubmit the query, until the query returns a "complete" status. This is explained with code here. TileDB maintains all the necessary internal state inside the query object.

But what portion of the query is served in each iteration? TileDB implements the incomplete query functionality via result estimation and subarray partitioning. Specifically, if TileDB assesses (via estimation heuristics) that the query subarray leads to a larger result size than the allocated buffers, it splits (i.e., partitions) it appropriately, such that a smaller subarray (single- or multi-range) can be served. The challenge is in partitioning the subarray in a way that the result cell order (defined by the user) is respected across the incomplete query iterations. TileDB efficiently and correctly performs this partitioning process transparently from the user.

Caching

TileDB caches data and metadata upon read queries. More specifically, it caches:

Fragment metadata for those fragments that are relevant to a submitted query. There are no restrictions on how large that cache space is. The user can flush the fragment metadata cache by simply closing all open instances of an array.
Data tiles that overlap a subarray (across fragments). This cache space is configurable (see Configuration Parameters). The data tiles are currently cached in their raw "disk" form (i.e., with all potential filters applied as they are stored in the data files).

The use of caching can be quite beneficial especially if the data resides on cloud object stores like AWS S3.

Selective Decompression

TileDB implements additional optimizations that improve decompression times and the overall memory footprint of a read query. Recall that each data tile is further decomposed into chunks. After fetching a data tile that contains result candidates from storage, the TileDB read algorithm knows exactly which chunks of the tile are relevant to the query and decompresses (unfilters) only those chunks.

Aggregates

TileDB supports fast and parallel aggregation of results. Currently, the results can only be aggregated over the whole returned dataset, which this page will call the default channel. To add aggregates to a query, the first thing to do is to get the default channel. For count (nullary aggregate), no operations need to be created. For the other aggregates, an operation needs to be created on the desired column. That operation can then be applied to the default channel, whilst defining the output field name for the result (for count, there is a constant operation that can be used to apply). Finally, buffers to receive the aggregate result can be specified using the regular buffer APIs on the query (see Basic Reading).

Note that ranges and query conditions can still be used to limit the rows to aggregate. Also note that TileDB allows getting the data and computing aggregates simultaneously. To do so, it is only required to specify buffers for the desired columns at the same time as the aggregated results. Here, the result of the aggregation will be available once the query is in a completed state (see Incomplete Queries).

Finally, here is a list of supported operations and information about the supported input field data type and the output datatype.

Aggregate operation

Input field type

Output type

Count

N/A

UINT64

Sum

Numeric fields

Signed fields: INT64 Unsigned fields: UINT64 Floating point fields: FLOAT64

Min/Max

Numeric/string fields

Same as input type

Null count

Nullable fields.

UINT64

Mean

Numeric fields

FLOAT64

Aggregate operation

Operation name

Count

"count"

Null count

"null_count"

Sum

"sum"

Min/Max

"min", "max"

Mean

"mean"

Atomicity

In TileDB, reads, writes, consolidation and vacuuming are all atomic and will never lead to array corruption.

Reading

A read operation is the process of (i) creating a read query object and (ii) submitting the query (potentially multiple times in the case of incomplete queries) until the query is completed. Each such read operation is atomic and can never corrupt the state of an array.

Writing

A write operation is the process of (i) creating a write query object, (ii) submitting the query (potentially multiple times in the case of global writes) and (iii) finalizing the query object (important only in global writes). Each such write operation is atomic, i.e., a set of functions (which depends on the API) that must be treated atomically by each thread. For example, multiple threads should not submit the query for the same query object. Instead, you can have multiple threads create separate query objects for the same array (even sharing the same context or array object), and prepare and submit them in parallel with each thread.

A write operation either succeeds and creates a fragment that is visible to future reads, or it fails and any folder and file relevant to the failed fragment is entirely ignored by future reads. A fragment creation is successful if a file <fragment_name>.ok appears in the array folder for the created fragment <fragment_name>. There will never be the case that a fragment will be partially written and still accessible by the reader. The user just needs to eventually delete the partially written folder to save space (i.e., a fragment folder without an associated .ok file). Furthermore, each fragment is immutable, so there is no way for a write operation to corrupt another fragment created by another operation.

Consolidation

Consolidation entails a read and a write and, therefore, it is atomic in the same sense as for writing. There is no way for consolidation to lead to a corrupted array state.

Vacuuming

Vacuuming simply deletes fragment folders and array/fragment metadata files. Vacuuming always deletes the .ok files before proceeding to erasing the corresponding folders. It is atomic in the sense that it cannot lead to array corruption and if the vacuuming process is interrupted, it can be restarted without issues.

Concurrency

In addition to using parallelism internally, TileDB is designed having parallel programming in mind. Specifically, scientific computing users may be accustomed to using multi-processing (e.g., via MPI or Dask or Spark), or writing multi-threaded programs to speed up performance. TileDB enables concurrency using a multiple writer / multiple reader model that is entirely lock-free.

Writes

Concurrent writes are achieved by having each thread or process create one or more separate fragments for each write operation. No synchronization is needed across processes and no internal state is shared across threads among the write operations and, thus, no locking is necessary. Regarding the concurrent creation of the fragments, thread- and process-safety is achieved because each thread/process creates a fragment with a unique name (as it incorporates a UUID). Therefore, there are no conflicts even at the storage backend level.

TileDB supports lock-free concurrent writes of array metadata as well. Each write creates a separate array metadata file with a unique name (also incorporating a UUID), and thus name collisions are prevented.

Reads

During opening the array, TileDB loads the array schema and fragment metadata to main memory once, and shares them across all array objects referring to the same array. Therefore, for the multi-threading case, it is highly recommended that you open the array once outside the atomic block and have all threads create the query on the same array object. This is to prevent the scenario where a thread opens the array, then closes it before another thread opens the array again, and so on. TileDB internally employs a reference-count system, discarding the array schema and fragment metadata each time the array is closed and the reference count reaches zero (the schema and metadata are typically cached, but they still need to be deserialized in the above scenario). Having all concurrent queries use the same array object eliminates the above problem.

Reads in the multi-processing setting are completely independent and no locking is required. In the multi-threading scenario, locking is employed (through mutexes) only when the queries access the tile cache, which incurs a very small overhead.

Mixing reads and writes

Concurrent reads and writes can be arbitrarily mixed. Fragments are not visible unless the write query has been completed (and the .ok file appeared). Fragment-based writes make it so that reads simply see the logical view of the array without the new (incomplete) fragment. This multiple writers / multiple readers concurrency model of TileDB is more powerful than competing approaches, such as HDF5’s single writer / multiple readers (SWMR) model. This feature comes with a more relaxed consistency model, which is described in the Consistency section.

Consolidation

Consolidation can be performed in the background in parallel with and independently of other reads and writes. The new fragment that is being created is not visible to reads before consolidation is completed.

Vacuuming

Vacuuming deletes fragments that have been consolidated. Although it can never lead to a corrupted array state, it may lead to issues if there is a read operation that accesses a fragment that is being vacuumed. This is possible when the array is opened at a timestamp before some consolidation operation took place, therefore considering the fragment to be vacuumed. Most likely, that will lead to a segfault or some unexpected behavior.

TileDB locks the array upon vacuuming to prevent the above. This is achieved via mutexes in multi-threading, and file locking in multi-processing (for those storage backends that support it).

All POSIX-compliant filesystems and Windows filesystems support file locking. Note that Lustre supports POSIX file locking semantics and exposes local- (mount with -o localflock) and cluster- (mount with -o flock) level locking. Currently, TileDB does not use file locking on HDFS and S3 (these storage backends do not provide such functionality, but rather resource locking must be implemented as an external feature). For filesystems that do not support filelocking, the multi-processing programs are responsible for synchronizing the concurrent writes.

Particular care must be taken when vacuuming arrays on AWS S3 and HDFS. Without filelocking TileDB has no way to prevent vacuuming from deleting the old consolidated fragments. If another process is reading those fragments while consolidation is deleting them, the reading process is likely to error out or crash.

In general, avoid executing vacuuming when time traveling upon reading in cloud object stores. It is generally safe to vacuum if you are reading the array at the current timestamp.

Array Creation

Array creation (i.e., storing the array schema on persistent storage) is not thread-/process-safe. We do not expect a practical scenario where multiple threads/processes attempt to create the same array in parallel. We suggest that only one thread/process creates the array, before multiple threads/processes start working concurrently for writes and reads.

Consistency

TileDB enables concurrent writes and reads that can be arbitrarily mixed, without affecting the normal execution of a parallel program. This comes with a more relaxed consistency model, called eventual consistency. Informally, this guarantees that, if no new updates are made to an array, eventually all accesses to the array will “see” the last collective global view of the array (i.e., one that incorporates all the updates). Everything discussed in this section about array fragments is also applicable to array metadata.

We illustrate the concept of eventual consistency in the figure below (which is the same for both dense and sparse arrays). Suppose we perform two writes in parallel (by different threads or processes), producing two separate fragments. Assume also that there is a read at some point in time, which is also performed by a third thread/process (potentially in parallel with the writes). There are five possible scenarios regarding the logical view of the array at the time of the read (i.e., five different possible read query results). First, no write may have completed yet, therefore the read sees an empty array. Second, only the first write got completed. Third, only the second write got completed. Fourth, both writes got completed, but the first write was the one to create a fragment with an earlier timestamp than the second. Fifth, both writes got completed, but the second write was the one to create a fragment with an earlier timestamp than the first.

The concept of eventual consistency essentially tells you that, eventually (i.e., after all writes have completed), you will see the view of the array with all updates in. The order of the fragment creation will determine which cells are overwritten by others and, hence, greatly affects the final logical view of the array.

Eventual consistency allows high availability and concurrency. This model is followed by the AWS S3 object store and, thus, TileDB is ideal for integrating with such distributed storage backends. If strict consistency is required for some application (e.g., similar to that in transactional databases), then an extra layer must be built on top of TileDB Open Source to enforce additional synchronization.

But how does TileDB deal internally with consistency? This is where opening an array becomes important. When you open an array (at the current time or a time in the past), TileDB takes a snapshot of the already completed fragments. This the view of the array for all queries that will be using that opened array object. If writes happen (or get completed) after the array got opened, the queries will not see the new fragments. If you wish to see the new fragments, you will need to either open a new array object and use that one for the new queries, or reopen the array (reopening the array bypasses closing it first, permitting some performance optimizations).

We illustrate with the figure below. The first array depicts the logical view when opening the array. Next suppose a write occurs (after opening the array) that creates the fragment shown as the second array in the figure. If we attempt to read from the opened array, even after the new fragment creation, we will see the view of the third array in the figure. In other words, we will not see the updates that occurred between opening and reading from the array. If we'd like to read from the most up-to-date array view (fourth array in the figure), we will need to reopen the array after the creation of the fragment.

When you write to TileDB with multiple processes, if your application is the one to be synchronizing the writes across machines, make sure that the machine clocks are synchronized as well. This is because TileDB sorts the fragments based on the timestamp in their names, which is calculated based on the machine clock.

Here is how TileDB reads achieve eventual consistency on AWS S3:

Upon opening the array, list the fragments in the array folder
Consider only the fragments that have an associated .ok file (the ones that do not have one are either in progress or not visible due to S3’s eventual consistency)
The .ok file is PUT after all the fragment data and metadata files have been PUT in the fragment folder.
Any access inside the fragment folder is performed with a byte range GET request, never with LIST. Due to S3’s read-after-write consistency model, those GET requests are guaranteed to succeed.

The above practically tells you that a read operation will always succeed and never be corrupted (i.e., it will never have results from partially written fragments), but it will consider only the fragments that S3 makes visible (in their entirety) at the timestamp of opening the array.

Parallelism

TileDB is fully parallelized internally, i.e., it uses multiple threads to process in parallel the most heavyweight tasks.

Reading

A read query mainly involves the following steps in this order:

Identifying the physical attribute data tiles that are relevant to the query (pruning the rest)
Performing parallel IO to retrieve those tiles from the storage backend.
Unfiltering the data tiles in parallel to get the raw cell values and coordinates.
Performing a refining step to get the actual results and organize them in the query layout.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel IO

TileDB reads the relevant tiles from all attributes to be read in parallel as follows:

TileDB computes the byte ranges required to be fetched from each attribute file. Those byte ranges might be disconnected and could be numerous especially in the case of multi-range subarrays. In order to reduce the latency of the IO requests (especially on S3), TileDB attempts to merge byte ranges that are close to each other and dispatch fewer larger IO requests instead of numerous smaller ones. More specifically, TileDB merges two byte ranges if their gap size is not bigger than vfs.min_batch_gap and their resulting size is not bigger than vfs.min_batch_size. Then, each byte range (always corresponding to the same attribute file) becomes an IO task. These IO tasks are dispatched for concurrent execution, where the maximum level of concurrency is controlled by the sm.io_concurrency_level parameter.

TileDB may further partition each byte range to be fetched based on parameters vfs.file.max_parallel_ops (for posix and Windows), vfs.s3.max_parallel_ops(for S3) and vfs.min_parallel_size. Those partitions are then read in parallel. Currently, the maximum parallel operations for HDFS is set to 1, i.e., this task parallelization step does not apply to HDFS.

Parallel Tile Unfiltering

Once the relevant data tiles are in main memory, TileDB "unfilters" them (i.e., runs the filters applied during writes in reverse) in parallel in a nested manner as follows:

The “chunks” of a tile are controlled by a TileDB filter list parameter that defaults to 64KB.

The sm.compute_concurrency_level parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting. The nested parallelism in reads allows for maximum utilization of the available cores for filtering (e.g. decompression), in either the case where the query intersects few large tiles or many small tiles.

Writing

A write query mainly involves the following steps in this order:

Re-organizing the cells in the global cell order and into attribute data tiles.
Filtering the attribute data tiles to be written.
Performing parallel IO to write those tiles to the storage backend.

TileDB parallelizes all steps, but here we discuss mainly steps (2) and (3) that are the most heavyweight.

Parallel Tile Filtering

For writes TileDB uses a similar strategy as for reads:

Similar to reads, the sm.compute_concurrency_level parameter impacts the for loops above, although it is not recommended to modify this configuration parameter from its default setting.

Parallel IO

Similar to reads, IO tasks are created for each tile of every attribute. These IO tasks are dispatched for concurrent execution, where the maximum level of concurrency is controlled by the sm.io_concurrency_level parameter. For HDFS, this is the only parallelization TileDB provides for writes. For the other backends, TileDB parallelizes the writes further.

For POSIX and Windows, if a data tile is large enough, the VFS layer partitions the tile based on configuration parameters vfs.file.max_parallel_ops and vfs.min_parallel_size. Those partitions are then written in parallel using the VFS thread pool, whose size is controlled by vfs.io_concurrency.

For S3, TileDB buffers potentially several tiles and issues parallel multipart upload requests to S3. The size of the buffer is equal to vfs.s3.max_parallel_ops * vfs.s3.multipart_part_size. When the buffer is filled, TileDB issues vfs.s3.max_parallel_opsparallel multipart upload requests to S3.

Asynchronous Queries

Default read/write queries in TileDB are synchronous or blocking. This means that the user function that is submitting the query has to block and wait until TileDB is done processing the query. There are scenarios in which you may want to submit the query in an asynchronous or non-blocking fashion. In other words, you may wish to submit the query but tell TileDB to process it in the background, while you proceed with the execution of your function and perform other tasks while TileDB is executing the query in parallel. TileDB supports asynchronous queries and enables you to check the query status (e.g., if it is still in progress). It also allows to pass a callback upon submission, i.e., specify a function that you wish TileDB to compute upon finishing processing the query. This applies to both dense and sparse arrays, as well as to both write and read queries.

The figure below shows the difference between synchronous and asynchronous query execution.

TileDB allocates a separate thread pool for asynchronous queries, whose size is controlled by configuration parameter sm.num_async_threads (defaulting to 1).

Datetimes

Values for the datetime types are internally stored and manipulated as int64 values. From the perspective of core TileDB internally, the datetime datatypes are simply aliases for TILEDB_INT64.

The meaning of an integral datetime value depends on three things:

A reference date. TileDB fixes this to the UNIX epoch time (1970-01-01 at 12:00 am). This is not currently configurable.
A unit of time. For example: day, month, hour, or nanosecond.
An integer value. This the integer number of time units relative to the reference date.

For example, a value of 10 for the type TILEDB_DATETIME_DAY refers to 12:00 am on 1970-01-10. A value of -18 for the type TILEDB_DATETIME_HR refers to 6:00 am on 1969-12-31, or 1969-12-31T06:00Z in ISO8601 format.

Encryption

TileDB allows you to encrypt your arrays at rest. It currently supports a single type of encryption, AES-256 in the GCM mode, which is a symmetric, authenticated encryption algorithm. When creating, reading or writing arrays you must provide the same 256-bit encryption key. The authenticated nature of the encryption scheme means that a message authentication code (MAC) is stored together with the encrypted data, allowing verification that the persisted ciphertext was not modified.

Encryption libraries used:

Encryption key lifetime

TileDB never persists the encryption key, but TileDB does store a copy of the encryption key in main memory while an encrypted array is open. When the array is closed, TileDB will zero out the memory used to store its copy of the key, and free the associated memory.

Performance

Due to the extra processing required to encrypt and decrypt array metadata and attribute data, you may experience lower performance on opening, reading and writing for encrypted arrays.

To mitigate this, TileDB internally parallelizes encryption and decryption using a chunking strategy. Additionally, when compression or other filtering is configured on array metadata or attribute data, encryption occurs last, meaning the compressed (or filtered in general) is what gets encrypted.

Finally, newer generations of some Intel and AMD processors offer instructions for hardware acceleration of encryption and decryption. The encryption libraries that TileDB employs are configured to use hardware acceleration if it is available.

Glossary

Array metadata

The array metadata are simple key-value pairs that the user can attach to an array. The key is a string and the value can be of any datatype. The array metadata is typically small. Time traveling applies to array metadata as well, i.e., opening an array at a timestamp will fetch only the array metadata created at or before the given timestamp.

The array metadata stores user-specific data about the array that is arbitrary key-value pairs.
The array schema stores system-specific data about the array that has a fixed structure (e.g., a dimension name, domain and datatype).

Array schema

The array schema stores all the details about the array definition. Some of the data it holds are:

Attributes (name, datatype, filters)
Dimensions (name, datatype, domain, filters)
Tile extent and capacity
Tile and cell order

Attribute

A non-empty cell (in either a dense or sparse array) is not limited to storing a single value. Each cell stores a tuple with a structure that is common to all cells. Each tuple element corresponds to a value on a named attribute of a certain type. An attribute can be:

Fixed-sized: an attribute value in a cell may consist of one or a fixed number of values of the same datatype
Variable-sized: an attribute value in a cell may consist of a variable number of values of the same datatype, i.e., different cells may store a different number of values on this attribute.

The figure below shows an example of an array with 3 attributes; a1 of type int32, a2 of type char: var and a3 of type float32: 2. Every non-empty cell must store 1 int32 value on a1, any number of char values on a2 and exactly 2 float32 values on a3.

Cell

An ordered tuple of dimension domain values, called coordinates, identifies an array cell. The order of the coordinates must follow the order in which the array dimensions were specified. The figure below depicts an example of cell (3, 4) assuming that the dimension order is d1, d2.

Consolidation

Coordinates

The coordinates of an array cell is an ordered tuple of dimension domain values that identifies it. In dense arrays, the coordinates of each cell are unique. In sparse arrays, the same coordinates may appear more than once.

Data tile

TileDB adopts the so-called columnar format and stores the (non-empty) cell values for each attribute separately. A data tile is a subset of cell values on a particular attribute. We explain the data tile separately for dense and sparse fragments, and its relationship to the space tile. The data tile is the atomic unit of compression and IO.

Contrary to dense fragments, there is no correspondence between space tiles and data tiles in sparse fragments. Consider the 8x8 fragment with 4x4 space tiles in the figure below. Assume for simplicity that the array stores a single int32 attribute. The non-empty cells are depicted in blue color. If we followed the data tiling technique of dense fragment, we would have to create 4 data tiles, one for each space tile. TileDB does not materialize empty cells, i.e., it stores only the values of the non-empty cells in the data files. Therefore, the space tiles would produce 4 data tiles with 3 (upper left), 12 (upper right), 1 (lower left) and 2 (lower right) non-empty cells.

The physical tile size imbalance that may result from space tiling can lead to ineffective compression (if numerous data tiles contain only a handful of values), and inefficient reads (if the subarray you wish to read only partially intersects with a huge tile, which needs to be fetched in its entirety and potentially decompressed). Ideally, we wish every data tile to store to the same number of non-empty cells. Recall that this is achieved in the dense case by definition, since each space tile has the same shape (equal number of cells) and all cells in each space tile are non-empty. Finally, since the distribution of the non-empty cells in the array may be arbitrary, it is phenomenally difficult to fine-tune the space tiling in a way that leads to load-balanced data tiles storing an acceptable number of non-empty cells, or even completely unattainable.

In other words, the space tiles in sparse fragments are used to determine the global cell order that will dictate which cell values will be grouped together in the same data tile. Another difference to dense fragments is that sparse fragments create extra data tiles for the coordinates of the non-empty cells, which is important in reading.

Dense vs. sparse array

There are three main differences between a dense and a sparse array:

A dense array is used when the majority of the cells are non-empty (within any hyper-rectangular sub domain), whereas a sparse array when the majority of the cells are empty.
The dimensions of a dense array must have the same datatype, whereas the dimensions of a sparse array may have different datatypes.
The dimensions of a dense array can only be of integer data type, whereas the dimensions of a sparse array may be of any data type (even real or string).
Every cell in a dense array is uniquely identified by its coordinates, whereas a sparse array can permit multiplicities, i.e., cells with the same coordinates but potentially different attribute values, as well as real (float32, float64) and string domains.

TileDB provides a unified API for both dense and sparse arrays.

Dimension

A multi-dimensional array consists of a set of ordered dimensions. A dimension has a name, a datatype and a domain. The figure below shows an example of two int32 dimensions, d1 with domain [1,4] and d2 with domain [2,6].

Domain

The array domain (or simply domain) is the hyperspace defined by the domains of the array dimensions. In a dense array, all dimensions must have the same type (homogeneous dimensions) and can only be integers. In a sparse array, the dimensions may have different type (heterogenous dimensions) and can be of any data type (even real and string).

The non-empty domain is the tightest hyper-rectangle that contains all non-empty cells. An example is shown in the figure below.

The dimension domains can have negative, real and string values. An array cell is still identified by its coordinates, which take any value from the corresponding dimension domain.

In our examples, the orientation of each dimension domain is rather arbitrary and does not affect the array definition. It is just a matter of convention. For example, the lower values may be at the top or bottom of the vertical dimension.

Empty cell

Not all array cells may contain values. A cell that contains values is called non-empty cell, otherwise it is called empty.

Fill values

Filters

Fragment

A fragment is a timestamped snapshot of a portion of the array, which is produced during writes. A fragment may be dense or sparse as shown in the figure below. In a dense fragment, the non-empty cells are contained in a full hyper-rectangle in the domain. This hyper-rectangle may cover the full domain or any subdomain. In a sparse fragment, the non-empty cells may be arbitrary, i.e., not necessarily comprise a full hyper-rectangle.

An array may consist of multiple fragments. Those fragments are completely transparent to the user, who only sees the combined logical view of the array upon reading. This is produced by superimposing the more recent fragments on top of the older ones, with the more recently written cells overwriting the older ones. A dense array may consist of both dense and sparse fragments, but a sparse array may consist only of sparse fragments.

Fragment metadata

The fragment metadata is system-specific information about a fragment. Some of the information this metadata includes is:

Dense or sparse
Non-empty domain
Tile offsets
Tile sizes
R-Tree (for the sparse case)

Global cell order

The tile and cell order collectively determine the global cell order. The global cell order is essentially a mapping from the multi-dimensional cell space to the 1-dimensional physical storage space for the non-empty cells, i.e., it is the order in which TileDB stores the cell values on disk. The figure below shows the 4 possible global cell orders resulting from all combinations of tile/cell orders. The numbers indicate the relative positions of the non-empty cells along the global order.

Groups

Groups allows hierarchically organizing arrays and other groups.

Incomplete query

Non-empty domain

The non-empty domain of an array is the minimum bounding hype-rectangle that tightly encompasses all non-empty cells in the array.

Nullable attribute

R-tree

Space tile & tile extent

A space tile is defined by specifying a tile extent along each dimension. The domain of each dimension is partitioned into segments equal to the tile extent, and hyper-rectangular tiles are formed in the multi-dimensional array space. The space tile concept applies to both dense and sparse arrays (as well as real dimensions) and is independent of the actual data stored in the array.

Subarray

A subarray is an array slice. A single-range subarray is defined by a domain range along each dimension. A multi-range subarray is defined by a multiple ranges per dimension. The resulting slice of a multi-range subarray is oriented by the cross-product of the ranges along all dimensions. Multi-range subarrays are applicable only to reads. Multi-range subarrays are applicable to both dense and sparse arrays.

Tile & cell order

Row-major: Assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, row-major means that the rightmost coordinate index “varies the fastest”.
Column-major: Assuming each tile or cell can be identified by a set of coordinates in the multi-dimensional space, column-major means that the leftmost coordinate index “varies the fastest”.

Time traveling

User-defined function (UDF)

Vacuuming

Variable-sized attribute

How To

Installation

Get set up with TileDB fast and easy

TileDB is distributed in two main components: an embeddable core TileDB library and the high-level APIs. The core library implements all TileDB functionality, and the APIs define the interfaces to the core library for different programming languages.

Depending on your intended use of TileDB, you will need to install the core TileDB library, or one of the high-level APIs such as for Python, or both. When installing a high-level TileDB API, the corresponding installation of the core library has been automated where possible.

Usage

Create Arrays

Performance

Key Concepts & Data Format

It is important to understand the key concepts of TileDB and the way they are reflected on persistent storage to take full advantage of the power of the TileDB Open Source engine.

The TileDB data format implements the data model and has the following high-level goals:

Efficient storage with support for compressors, encryption, etc.
Efficient multi-dimensional slicing (i.e., fast reads)
Efficient access on cloud object stores (in addition to other backends)
Support for data versioning and time traveling (built into the format)

Arrays

The general file structure for arrays looks as follows:

my_array
    ├── __commits
    ├── __fragment_meta
    ├── __fragments
    ├── __meta
    └── __schema

The main array components are the following:

Array schema (__schema)
Fragments (__fragments)
Consolidated fragment metadata (__fragment_meta)
Commits (__commits)
Array metadata (__meta)

Arrays can be hierarchically organized into groups.

Array schema

The array schema contains important information about the array definition, such as the number, type and names of dimensions, number, type and names of attributes, the domain, and more.

The file structure within the array schema directory is as follows:

my_array                            # array directory
   ├──  ...
   └── __schema                      # array schema directory
         ├── <timestamped_name>      # array schema file
         └── ...

<timestamped_name> has format __timestamp_timestamp_uuid, where:

timestamp is timestamp in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)
uuid is a unique identifier

The timestamped files allow for versioning and time traveling. Each fragment (explained below) is associated with one of these array schema files.

Fragments

A fragment is stored as a directory with the following structure:

my_array                              # array directory
   ├──  ...
   ├── __fragments
   │   └── <timestamped_name>          # fragment directory
   │      ├── __fragment_metadata.tdb  # fragment metadata
   │      ├── a0.tdb                   # fixed-sized attribute 
   │      ├── a1.tdb                   # var-sized attribute (offsets) 
   │      ├── a1_var.tdb               # var-sized attribute (values)
   │      ├── ...         
   │      ├── a2_validity.tdb          # validity of fixed- or var-sized attribute
   │      ├── ...      
   │      ├── d0.tdb                   # fixed-sized dimension 
   │      ├── d1.tdb                   # var-sized dimension (offsets) 
   │      ├── d1_var.tdb               # var-sized dimension (values)
   │      └── ...      
   └── ...

<timestamped_name> has format __t1_t2_uuid_v, where:

t1 and t2 are timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)
uuid is a unique identifier
v is the format version

The array cell data (attribute values and dimension coordinates) are stored in files inside the fragment directory. There are the following types of files:

fixed-sized attribute data files, named a1.tdb, a2.tdb, ...
var-sized attribute data files, which are pairs of the form (a1.tdb, a1_var.tdb), (a2.tdb, a2_var.tdb), ... . The second *_var.tdb file of the pair contains the var-sized values, whereas the first contains the starting byte offsets of each value in the second file.
fixed-sized dimension data files, named d1.tdb, d2.tdb, ...
var-sized dimension data files, which are pairs of the form (d1.tdb, d1_var.tdb), (d2.tdb, d2_var.tdb), ... The second *_var.tdb file of the pair contains the var-sized values, whereas the first contains the starting byte offsets of each value in the second file.
validity data files for nullable attributes, named a1_validity.tdb, a2_validity.tdb, ...., associated with attribute a1, a2, ...., respectively. The validity files are simple bytevectors that indicate whether a cell value is null or not. The validity files are applicable to both fixed- and var-sized attributes, but they are not applicable to dimensions. They are also optional; the user may or may not specify an attribute as nullable.

This layout where values of the same type are grouped together is ideal for compression, vectorization and subsetting on attributes/dimensions during queries (similar to other columnar databases).

Consolidated Fragment Metadata

my_array                                 # array directory
   ├──  ...
   └── __fragment_meta                   # consolidated fragment metadata directory
         ├── <timestamped_name>.meta     # consolidated fragment metadata file
         └── ...

<timestamped_name>.meta has format __t1_t2_uuid_v, where:

t1 and t2 are the timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC) of the oldest and most recent fragment whose fragment metadata footer was consolidated in this file
uuid is a unique identifier
v is the format version

Commits

my_array                                 # array directory
   ├──  ...
   └── __commits  .                      # commit directory
         ├── <timestamped_name>.wrt      # commit file
         ├── <timestamped_name>.con      # consolidated commit file
         └── ...

Array metadata

my_array                              # array folder
   ├──  ...
   └── __meta                         # array metadata folder
         ├── <timestamped_name>       # array metadata file
         ├── ...  
         ├── <timestamped_name>.vac   # vacuum file
         └── ...

<timestamped_name> has format __t1_t2_uuid_v, where:

t1 and t2 are timestamps in milliseconds elapsed since 1970-01-01 00:00:00 +0000 (UTC)
uuid is a unique identifier
v is the format version

The vacuum files *.vac are explained in section Consolidation.

Groups

The general file structure for groups looks as follows:

my_group
├── __group
│   └── __<timestamped_name>
├── __meta
└── __tiledb_group.tdb

There are three components:

__tiledb_group.tdb is an empty file indicating that my_group is a TileDB group.
__meta stores key-value metadata associated with the group. This functionality is identical to that of array metadata.
__group contains timestamped files (with a similar structure to those described above for all other components, such as fragments, consolidated fragment metadata, etc), which store the absolute paths of other arrays and groups.

Data layout

A tile is the atomic unit of IO and compression, (as well as other potential filters, such as encryption, checksum, etc.).

In a dense array, the global order and tiling is determined by 3 parameters:

The space tile extent per dimension.
The cell order inside each tile
The tile order

The figure below depicts three different orders as we vary the above parameters. Observe that the shape and size of each tile is dictated solely by the space tile extents.

In sparse arrays, the global order could be determined as follows:

By specifying the same three parameters as in the dense case, or
By using a Hilbert space-filling curve

Indexing

The above indexing information, along with other auxiliary data (e.g., byte offsets of tiles in the files on disk) is stored in the fragment metadata file of each fragment.

Consolidation

Fragment metadata

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   ├── __fragments                  # fragment directory 
   │    ├── __t1_t1_uuid1_v         # fragment
   │    └── __t2_t2_uuid2_v         # fragment            
   └── __fragment_meta              # consol. fragment metadata directory

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   ├── __fragments                  # fragment directory 
   │    ├── __t1_t1_uuid1_v         # fragment
   │    └── __t2_t2_uuid2_v         # fragment            
   └── __fragment_meta              # consol. fragment metadata directory
         ├── __t1_t2_uuid3_v.meta   # consol. fragment metadata file
         └── ...

Fragments

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.vac     # vacuum file
   │    ├── __t1_t2_uuid3_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        ├── __t1_t2_uuid3_v         # consolidated fragment
        └── __t2_t2_uuid2_v         # fragment

Commits

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.con     # consolidated commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        └── __t2_t2_uuid2_v         # fragment

Array metadata

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    ├── __t1_t1_uuid1           # array metadata file
   │    └── __t2_t2_uuid2           # array metadata file       
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    ├── __t1_t1_uuid1           # array metadata file
   │    ├── __t1_t2_uuid3           # consolidated array metadata file
   │    ├── __t1_t2_uuid3.vac       # vacuum file
   │    └── __t2_t2_uuid2           # array metadata file       
   └──  ...

Vacuuming

Fragment metadata

my_array                            # array directory
   ├──  ...
   ├── __fragmemt_meta              # fragment metadata directory 
   │    ├── __t1_t2_uuid1_v.meta    # consol. fragment metadata file
   │    └── __t3_t4_uuid2_v.meta    # consol. fragment metadata file
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __fragmemt_meta              # fragment metadata directory 
   │    └── __t3_t4_uuid2_v.meta    # consol. fragment metadata file
   └──  ...

Fragments

Vacuuming for fragments results in deleting fragment that took part in a consolidation process. We distinguish two cases.

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.vac     # vacuum file
   │    └── __t1_t2_uuid3_v.wrt     # commit file       
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        ├── __t1_t2_uuid3_v         # consolidated fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    └── __t1_t2_uuid3_v.wrt     # commit file       
   └── __fragments                  # fragment directory 
        └── __t1_t2_uuid3_v         # consolidated fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t2_uuid3_v.vac     # vacuum file
   │    └── __t1_t2_uuid4_v.con     # consolidated commit file       
   └── __fragments                  # fragment directory 
        ├── __t1_t1_uuid1_v         # fragment
        ├── __t1_t2_uuid3_v         # consolidated fragment
        └── __t2_t2_uuid2_v         # fragment

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t2_uuid5_v.ign     # ignore commits file
   │    └── __t1_t2_uuid4_v.con     # consolidated commit file       
   └── __fragments                  # fragment directory 
        └── __t1_t2_uuid3_v         # consolidated fragment

Commits

Vacuuming commits deletes the commit files of the fragments which participated in a commit consolidation process (included in __commits/*.com files). An example is shown below:

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    ├── __t1_t1_uuid1_v.wrt     # commit file
   │    ├── __t1_t2_uuid3_v.con     # consolidated commit file
   │    └── __t2_t2_uuid2_v.wrt     # commit file       
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __commits                    # commit directory 
   │    └── __t1_t2_uuid3_v.con     # consolidated commit file 
   └──  ...

Array metadata

Vacuuming consolidated array metadata deletes the array metadata files that participated in the consolidation, plus the corresponding *.vac files. An example is shown below.

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    ├── __t1_t1_uuid1           # array metadata file
   │    ├── __t1_t2_uuid3           # consolidated array metadata file
   │    ├── __t1_t2_uuid3.vac       # vacuum file
   │    └── __t2_t2_uuid2           # array metadata file       
   └──  ...

my_array                            # array directory
   ├──  ...
   ├── __meta                       # array metadata directory 
   │    └── __t1_t2_uuid3           # consolidated array metadata file
   └──  ...

Tile filtering

TileDB supports a number of filters, described below, and more will continue to be added in the future.

Compression filters

There are several filters performing generic compression, which are the following:

GZIP: Compresses with Gzip
ZSTD: Compresses with Zstandard
LZ4: Compresses with LZ4
RLE: Compresses with run-length encoding
BZIP2: Compresses with Bzip2
DOUBLE_DELTA: Compresses with double-delta encoding

Byteshuffle

This filter performs byte shuffling of data as a way to improve compression ratios. The byte shuffle implementation used by TileDB comes from the Blosc project.

For example, consider three 32-bit unsigned integer values 1, 2, 3, which have the following little-endian representation when stored adjacent in memory:

0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 0x00 0x00 0x00

0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Observe the longer run of zero-valued bytes, which will compress more efficiently.

Bitshuffle

This filter performs bit shuffling of data as a way to improve compression ratios. The bitshuffle implementation used in TileDB comes from https://github.com/kiyo-masui/bitshuffle.

Typically this filter is not used on its own, but rather it is immediately followed by a compression filter in a filter list.

Positive-delta encoding

The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.

Bit width reduction

This filter performs bit-width reduction, which is a form of compression.

Bit-width reduction only works on integral datatypes.

Encryption and checksums

TileDB also has checksum support for checking data integrity. It currently supports MD5 and SHA256.

Crypto libraries used:

macOS and Linux: OpenSSL
Windows: Next generation cryptography (CNG)

Setting Filters

Array attributes and dimensions accept compressors and other filters.

Creating a Filter List

Compression

The following example shows how to create a filter list with a GZIP compressor and compression level 10.

#include <tiledb/tiledb.h>

// Create context
tiledb_ctx_t* ctx;
tiledb_ctx_alloc(nullptr, &ctx);

// Create compressor as a filter
tiledb_filter_t* filter;
tiledb_filter_alloc(ctx, TILEDB_FILTER_GZIP, &filter);
int level = 10;
tiledb_filter_set_option(ctx, filter, TILEDB_COMPRESSION_LEVEL, &level);

// Create filter list
tiledb_filter_list_t* filter_list;
tiledb_filter_list_alloc(ctx, &filter_list);

// Add compressor to filter list
tiledb_filter_list_add_filter(ctx, filter_list, filter);

// Clean up
tiledb_filter_free(&filter);
tiledb_filter_list_free(&filter_list);
tiledb_ctx_free(&ctx);

#include <tiledb/tiledb>
using namespace tiledb;

// Create context
Context ctx;

// Create compressor as a filter
Filter filter(ctx, TILEDB_FILTER_GZIP);
filter.set_option(ctx, TILEDB_COMPRESSION_LEVEL, 10);

// Create filter list
FilterList filter_list(ctx);

// Add compressor to filter list
filter_list.add_filter(filter)

import tiledb

# tiledb.FilterList accepts an iterable of zero or more filters:
filter_list = tiledb.FilterList([tiledb.GZipFilter(level=10)])

# create a "GZIP" compression filter
flt <- tiledb_filter("GZIP")
# set the option 'COMPRESSION_LEVEL' to 10
tiledb_filter_set_option(flt, "COMPRESSION_LEVEL", 10)

# create a filter list with this filter
fltlst <- tiledb_filter_list(flt)

// Create context
Context ctx = new Context();

// Create compressor as a filter
Filter filter = new GzipFilter(ctx, 10);

// Create filter list
FilterList filterList = new FilterList(ctx);

// Add compressor to filter list
filterList.addFilter(filter);

// Create context
context, _ := NewContext(nil)

// Create compressor as a filter
filter, _ := NewFilter(context, TILEDB_FILTER_GZIP)
filter.SetOption(TILEDB_COMPRESSION_LEVEL, int32(10))

// Create filter list
filterList, _ := NewFilterList(context)

// Add compressor to filter list
filterList.AddFilter(filter)

using TileDB.CSharp;

using Context ctx = new Context();
using Filter filter = new Filter(ctx, FilterType.Gzip);
using FilterList filterList = new FilterList(ctx);

filterList.AddFilter(filter);

Supported compressors:

Compressor

Description

Option (type)

TILEDB_FILTER_GZIP

GZIP