The Problem

The need to manage data has existed for decades and there are thousands of different data management solutions available today. So what is the problem that TileDB is tackling and how could we innovate in such a crowded space?

To answer this question, we first outline a few observations in the evolution of data management systems and the factors that drove the various architectural changes over the years. We then stress that as old architectures adapted to new user requirements, they created new challenges. We argue that there is a need for rearchitecting data management from the ground up, in a universal way that is flexible and extendible to future user needs and technological advancements. This is the motivation behind building a universal storage engine (TileDB Embedded) and a universal data engine (TileDB Cloud). We explain in detail these novel notions and their compelling benefits for any individual or organization that deals with the storing, processing and sharing large quantities of data.

Key takeaways:

  • Current data management systems focus on a single data type, mostly tables

  • Users need to analyze diverse data with a variety of tools and languages, beyond SQL

  • There is a growing need for planet-scale sharing and serverless compute

  • All the above can be addressed by a novel concept, the universal data engine

  • TileDB is the first and only universal data engine

What is wrong with data management today?

Monolithic databases

When we ask people about what comes to their mind when hearing the term "database", they consistently agree on two things: tables and SQL. This is probably because the first databases were relational dealing with tabular data, whereas SQL became a prevalent data language due to its great expressivity. Databases have evolved a lot, and today we have numerous different terms like "document databases", "graph databases", "NoSQL databases", "data warehouses", "time series databases", "in-memory databases", "key-value databases" and many more. In general, the "database" is a quite sophisticated piece of software that manages data, from storage to computation to access control to logging and more.

Database systems were originally architected as monolithic, i.e., as a set of software layers that are not exposed to the third-party user. For example, a typical database system consists of a SQL parser, a transactional manager, an authenticator, a query planner, a query optimizer, an execution engine, a storage engine, and potentially many other layers. The user has no control over how the data is stored or processed, other than defining table schemas and potentially declaring indexes to boost performance.

Here are some problems with monolithic databases:

  • A single data type. Database systems revolve around a specific data type, e.g., only tables or only documents or only graphs, etc. That makes a database difficult to use in applications where the data cannot or should not be modeled as the above data types (e.g., imaging or video), or in applications that deal with more than one data types.

  • Data access from other tools. Although SQL can be powerful, there are applications that may require advanced computations not supported by a SQL engine (e.g., Linear Algebra, statistical models, machine learning, etc). In those cases, the user needs to slice data from the database by issuing a SQL query via an ODBC or JDBC connector and convert the returned results to a format that another tool can consume. In addition to the extra copy and conversion cost, the ODBC/JDBC connector itself adds significant overhead and may lead to serious scalability issues when there are thousands of workers concurrently slicing data from the database (as is typical in certain scientific and analytics workloads).

  • Cost. Many monolithic databases charge based on the amount of data stored in the database. This can become extremely expensive for applications with huge volumes of data (e.g., weather, satellite imaging, genomics, etc), where storage substantially surpasses the computational needs. In other words, users end up paying more for storing rather than managing and computing on the data.

  • Cloud object storage. Most of the databases are not architected to work well on cloud object stores. Such storage solutions can significantly reduce the costs, but they also introduce new challenges (such as eventual consistency) that the databases need to tackle by tweaking their persistent data format and storage engines.

  • Administration. Even if the database is open-source and free, organizations still need to hire full-time teams to administrate the database, as well as provision storage and compute on-premises or in the cloud.

  • Sharing. Databases implement advanced access control mechanisms. However, they typically target organizational level authentication that is constrained by the scale of the database cluster. That prevents organizations and individuals from sharing data with collaborators at planet scale.

Compute engines with pluggable storage

Due to some of the above problems, existing databases evolved and new solutions were introduced. One of the most important shifts was the separation of storage from compute. Some database systems introduced pluggable storage, i.e., mechanisms that allowed users to add external data sources to the database system. That further enables the users to store their data on cheap cloud object stores (e.g., AWS S3) and run SQL queries only when they need so, thus significantly reducing their total cost of operation (TCO). Furthermore, if the data is stored in an open-source data format (e.g., Parquet), then the users can efficiently access the data from many other tools, bypassing the database. In such cases, databases act more like compute engines on pluggable storage accessible by any tool.

In addition, distributed computational frameworks (like Spark and Dask) recently gained popularity. Those systems have pluggable storage by default (as they were designed for generic compute on any data), and allow you to easily run user-defined functions (written in Java, Scala, Python and R for Spark, and Python for Dask) on any data in a scalable way. Spark took it many steps further and started offering an advanced SQL engine and optimizer, providing competitive SQL performance to other distributed databases, whereas efforts on SQL execution are starting in the Dask world as well.

The shift to compute engines with pluggable storage was meaningful, but it introduced new problems:

  • Relinquished data control. Data used to be first-class citizens in monolithic databases, but have now become second-class citizens in compute engines with pluggable storage. A database can still offer advanced data management when the data is used via the database, but once the data is directly accessed through a language API or another tool that bypasses the database, there is no layer for access control, logging, etc. This logic is either lost or pushed instead to the application layer, which leads to re-invention of the wheel and waste of organization resources.

  • Coming full circle. Users have identified the above problem and started adopting solutions like Delta Lake, adding an extra management layer on top of pluggable storage. However, Delta Lake (and similar solutions) apply only to tabular data (e.g., Parquet) and certain systems like Spark and Presto. Does this ring a bell? We are essentially returning to a monolithic approach working on a single data type and only with specific computational engines, practically inheriting all the problems described in the previous section.

A sea of files and data formats

In the previous section we explained that databases introduced pluggable storage and applications started storing the data in open-source data formats like Parquet, which can then be stored on cheap cloud storage and accessed by any tool that understands the file format. This trend created a new problem, particularly exacerbated on cloud object stores where all objects are immutable (and, thus, each update typically creates a new object). Applications started generating an excessive number of files.

The scientific world has been familiar with handling a sea of files for a very long time. And all the while database applications seem to converge to using Parquet as the de facto tabular data format, each scientific domain introduces a different data format (or numerous formats). For instance, genomic variants are stored in VCF, LiDAR points in LAZ, satellite imaging in COG (among others), weather data in NetCDF4 (among others), and the list is very long.

The excessive number of files and numerous different data formats created several problems:

  • Lack of management. Organizations either delegate the management of those files to the cloud object store that is limited and cumbersome (e.g., defining access roles, trying to make sense of file-based logs, etc.), or spend enormous efforts building custom solutions (often involving using a relational database) on top of those files for access control, data versioning, etc.

  • Missing out on interoperability. Each data format typically comes with a library that can understand that format. Most of those libraries are quite limited in offering various language APIs, and efficient integration with databases and data science tools. Scientific applications are often limited to custom domain-specific tooling, missing out on the growing tooling ecosystem, or practitioners spend enormous efforts wrangling the data from each file format to a new format that the higher level tooling can process.

  • Missing out on tech advancements. Many legacy file formats and associated libraries are old and may have not been architected to work with modern storageĀ (e.g., cloud object stores, new compressors) and compute (e.g., GPU, FPGA) technologies. Therefore, applications that continue to use them become less performant or more expensive than is possible as technology advances. Refactoring every single library to keep up with progress is a very tall order.

  • Re-invention of the storage engine. Even if every library behind each legacy format gets upgraded to support cloud storage, new compressors, new hardware, etc., essentially an excessive number of human hours is spent on re-inventing common IO or processing components that every storage engine must have. This is what motivated us to build a universal storage engine, explained in detail in the next section.

The need for a universal data engine

The challenges we discussed above call for a bold, groundbreaking and holistic approach. We are the first to make the observation that data management is actually not very different across application domains, despite how different the data and terminology looks like. Therefore, we decided to invent one data model, one data format and one storage engine to provide a foundational solution for data management for all verticals. On top of those, we built extra management layers for access control and logging, in order to support data sharing and collaboration at planet scale. Our efforts led to the creation of a powerful serverless platform for designing and executing any distributed algorithm, which offers ease of use, performance, and low cost. We explain these contributions below.

A universal data format and storage engine

Regardless of the application domain and data types, the data storage and access requirements that are common:

  • Data compression and, thus, data chunking for efficient selective data retrieval

  • Support for multiple storage backends and extendibility to new ones

  • Parallel IO

  • Parallelism for (de)compression and other data filtering

  • Minimization of IO requests, which further requires:

    • A lightweight protocol between the client and the storage backend

    • Collocation of data that are frequently fetched together in the files

  • An embeddable storage library for easy use by higher level applications

  • Data versioning and time traveling

  • Atomicity, concurrency and consistency for reads and write operations

  • A growing set of efficient language APIs for flexible data access

  • A growing set of integrations with SQL engines and other data science tools

It does not really matter if the data is an image, or a cohort of genomic variants, or 3D LiDAR points, or dataframes, or flat binary objects. Anyone that attempts to build a performant storage engine and wishes to be used by diverse programing languages and tools will have to consider the above bullet points.

Another important observation is that there indeed exists a single data structure that can model any data type: (dense or sparse) multi-dimensional arrays. Arrays are more flexible and diverse than dataframes and can help in designing and implementing highly performant storage engines. Multi-dimensional (ND) slicing is arguably the most frequently used operator in any computational workload, e.g., for fast initial data filtering or for implementing out-of-core (to scale beyond RAM) or distributed (to scale beyond a single machine) algorithms.

Here is how we innovate:

  • We introduce a novel universal data model that can represent any data as dense or sparse multi-dimensional arrays

  • We design a novel open-spec data format around arrays that enables building a powerful universal storage engine.

  • We implement the first universal storage engine around the array format, called TileDB Embedded. It is built in C++ and open-sourced under the MIT License, and it comes with numerous efficient language APIs. It supports several storage backends and it is extendible to future ones.

  • We integrate TileDB Embedded into a growing set of SQL engines and data science tools, all open-source.

The benefits:

  • You can store any data type in a single format processed by a single powerful engine.

  • You can store your data on any backend.

  • You can efficiently access your data via numerous languages and processing tools.

  • You can experience extreme parallelism and performance.

  • You can enjoy fast updates, data versioning and time traveling at the storage level.

Planet-scale sharing

Through our interactions with users and customers across a variety of verticals, it is evident that people are struggling to securely share data and code with others, even beyond their organizations. This can be for collaboration and scientific reproducibility, or because companies wish to monetize their proprietary data and work with third parties. Either way, data sharing today starts to break organizational boundaries and become planet-scale: any individual or organization seeks the ability to easily share data and code with literally anyone else in the world, enforcing access policies and logging every single action.

But how would you typically share data today?

  • Store data in a database. Databases offer advanced access control and logging features, but (1) you inherit all the problems described in previous sections and (2) you are limited within the organization that runs the database and the database cluster resources you have allocated.

  • Store files on a cloud object store. You can always store your data on a cloud object store and use its features to manage data access. However, an object store is designed to store files and, therefore, it offers file semantics when it comes to access control and logging. If your application requires the creation of numerous files and fine-grained access policies (on a byte range level), then sharing selected data with others becomes extremely cumbersome. In addition, cloud object stores were not designed to support user policies in the order of millions for each object and, therefore, cannot be considered as planet-scale solutions.

The fact that we built a powerful storage engine on a universal data format allows us to address the above problems and provide planet-scale sharing in a holistic way by building a unified platform on top of TileDB Embedded.

Here is how we innovate:

  • We built a novel platform called TileDB Cloud for planet-scale sharing.

  • TileDB Cloud is totally serverless and has a pay-as-you-go model.

  • TileDB Cloud uses TileDB Embedded for data storage and access.

  • TileDB Cloud operates solely using the open-spec array format.

The benefits:

  • You can easily share with anyone on the planet or discover new data.

  • No need to spin up clusters for sharing your data or accessing others' data.

  • No idle compute cost.

  • TileDB Cloud scales to any number of users accessing the same array concurrently.

  • TileDB Cloud logs every access and allows you to easily audit the logs.

  • You own your data, TileDB Cloud only securely enforces the access policies.

  • There is no vendor lock-in, your data remains in TileDB's open-spec array format and is accessible via the open-source TileDB Embedded outside TileDB Cloud.

Serverless compute

We feel quite confident that we are on a great track with data storage and access control. But how about computations? Through interactions with users and personal experiences, we made some further observations:

  • Data scientists really like Jupyter notebooks, as they provide a great way to run and share code, while combining comprehensive documentation in text form along with the code. Moreover, it is faster and cheaper to have a hosted notebook on the cloud close to the data one wishes to access (e.g., on EC2 instances in the same region as the S3 bucket storing the data). Manually spinning up such hosted notebooks is rather cumbersome.

  • There are certain computations, such as SQL queries or arbitrary user-defined functions (UDFs) in a variety of languages, that the users would wish to perform on cloud-stored or shared data, in a totally serverless manner and avoiding transferring and copying data across multiple hops. This is because it is easier (no need to spin up cloud instances and clusters), cheaper (no idle compute costs) and faster (compute is sent to the data).

  • Any complex out-of-core or distributed algorithm working on large quantities of data can be expressed as a direct acyclic graph (DAG) of simple tasks, where each task slices and operates on a small block of data. For example, Dask relies on this task execution model.

In order to support planet-scale sharing as explained in the previous section, we were required to build a powerful, scalable and entire serverless infrastructure for TileDB Cloud. This infrastructure enabled us to provide further generic serverless compute capabilities.

Here is how we innovate:

  • TileDB Cloud allows you to spin up Jupyter notebooks on your browser.

  • TileDB Cloud allows you to share notebooks with other users.

  • TileDB Cloud allows you to register, share and perform any UDF in a serverless manner.

    • TileDB Cloud currently supports only Python UDFs, but it is architected to be language agnostic. Support for other languages (e.g., R) is coming up soon.

  • TileDB Cloud allows you to define arbitrary task DAGs and execute them in a serverless manner.

The benefits:

  • Great ease of use for exploring and analyzing anyone's data.

  • Extreme scalability as you can spin up any number of tasks and arbitrarily large DAGs.

  • No engineering hassles for spinning up clusters.

  • Lower TCO by preventing cluster provisioning and eliminating idle compute cost.

  • You can build any distributed complex algorithm using the TileDB Cloud serverless infrastructure, and share it with any other user in the world.

The era of pluggable compute

So from monolithic databases, data management evolved to compute engines with pluggable storage, and now we take it one step further with TileDB: we introduce the concept of universal data engines with pluggable compute. The main benefit is that data management features great similarity across all different applications. That includes data modeling, storage, slicing, access control, logging and interoperability. These features can be built once and be shared with every higher level application that needs to perform some more specialized computations. Abstracting data management and pushing common primitives down to a common storage engine eliminates obstacles and saves incredible amounts of time for data scientists and analysts that wish to focus on the science instead of the engineering, but also helps developers build brilliant new computational engines and algorithms on top of the universal data engine.

We look forward to hearing your feedback about the TileDB vision and software ecosystem. The utter goal of the TileDB project (and the TileDB, Inc. company) is to accelerate science and technology. And you can contribute to that goal!