The need to manage data has existed for decades and there are thousands of different data management solutions available today. So what is the problem that TileDB is tackling and how could we innovate in such a crowded space?
To answer this question, we first outline a few observations in the evolution of data management systems and the factors that drove the various architectural changes over the years. We then stress that as old architectures adapted to new user requirements, they created new challenges. We argue that there is a need for rearchitecting data management from the ground up, in a universal way that is flexible and extendible to future user needs and technological advancements. This is the motivation behind building a universal storage engine (TileDB Embedded) and a universal data engine (TileDB Cloud). We explain in detail these novel notions and their compelling benefits for any individual or organization that deals with the storing, processing and sharing large quantities of data.
When we ask people about what comes to their mind when hearing the term "database", they consistently agree on two things: tables and SQL. This is probably because the first databases were relational dealing with tabular data, whereas SQL became a prevalent data language due to its great expressivity. Databases have evolved a lot, and today we have numerous different terms like "document databases", "graph databases", "NoSQL databases", "data warehouses", "time series databases", "in-memory databases", "key-value databases" and many more. In general, the "database" is a quite sophisticated piece of software that manages data, from storage to computation to access control to logging and more.
Database systems were originally architected as monolithic, i.e., as a set of software layers that are not exposed to the third-party user. For example, a typical database system consists of a SQL parser, a transactional manager, an authenticator, a query planner, a query optimizer, an execution engine, a storage engine, and potentially many other layers. The user has no control over how the data is stored or processed, other than defining table schemas and potentially declaring indexes to boost performance.
Due to some of the above problems, existing databases evolved and new solutions were introduced. One of the most important shifts was the separation of storage from compute. Some database systems introduced pluggable storage, i.e., mechanisms that allowed users to add external data sources to the database system. That further enables the users to store their data on cheap cloud object stores (e.g., AWS S3) and run SQL queries only when they need so, thus significantly reducing their total cost of operation (TCO). Furthermore, if the data is stored in an open-source data format (e.g., Parquet), then the users can efficiently access the data from many other tools, bypassing the database. In such cases, databases act more like compute engines on pluggable storage accessible by any tool.
In addition, distributed computational frameworks (like Spark and Dask) recently gained popularity. Those systems have pluggable storage by default (as they were designed for generic compute on any data), and allow you to easily run user-defined functions (written in Java, Scala, Python and R for Spark, and Python for Dask) on any data in a scalable way. Spark took it many steps further and started offering an advanced SQL engine and optimizer, providing competitive SQL performance to other distributed databases, whereas efforts on SQL execution are starting in the Dask world as well.
In the previous section we explained that databases introduced pluggable storage and applications started storing the data in open-source data formats like Parquet, which can then be stored on cheap cloud storage and accessed by any tool that understands the file format. This trend created a new problem, particularly exacerbated on cloud object stores where all objects are immutable (and, thus, each update typically creates a new object). Applications started generating an excessive number of files.
The scientific world has been familiar with handling a sea of files for a very long time. And all the while database applications seem to converge to using Parquet as the de facto tabular data format, each scientific domain introduces a different data format (or numerous formats). For instance, genomic variants are stored in VCF, LiDAR points in LAZ, satellite imaging in COG (among others), weather data in NetCDF4 (among others), and the list is very long.
The challenges we discussed above call for a bold, groundbreaking and holistic approach. We are the first to make the observation that data management is actually not very different across application domains, despite how different the data and terminology looks like. Therefore, we decided to invent one data model, one data format and one storage engine to provide a foundational solution for data management for all verticals. On top of those, we built extra management layers for access control and logging, in order to support data sharing and collaboration at planet scale. Our efforts led to the creation of a powerful serverless platform for designing and executing any distributed algorithm, which offers ease of use, performance, and low cost. We explain these contributions below.
Regardless of the application domain and data types, the data storage and access requirements that are common:
Data compression and, thus, data chunking for efficient selective data retrieval
Support for multiple storage backends and extendibility to new ones
Parallelism for (de)compression and other data filtering
Minimization of IO requests, which further requires:
A lightweight protocol between the client and the storage backend
Collocation of data that are frequently fetched together in the files
An embeddable storage library for easy use by higher level applications
Data versioning and time traveling
Atomicity, concurrency and consistency for reads and write operations
A growing set of efficient language APIs for flexible data access
A growing set of integrations with SQL engines and other data science tools
It does not really matter if the data is an image, or a cohort of genomic variants, or 3D LiDAR points, or dataframes, or flat binary objects. Anyone that attempts to build a performant storage engine and wishes to be used by diverse programing languages and tools will have to consider the above bullet points.
Another important observation is that there indeed exists a single data structure that can model any data type: (dense or sparse) multi-dimensional arrays. Arrays are more flexible and diverse than dataframes and can help in designing and implementing highly performant storage engines. Multi-dimensional (ND) slicing is arguably the most frequently used operator in any computational workload, e.g., for fast initial data filtering or for implementing out-of-core (to scale beyond RAM) or distributed (to scale beyond a single machine) algorithms.
Through our interactions with users and customers across a variety of verticals, it is evident that people are struggling to securely share data and code with others, even beyond their organizations. This can be for collaboration and scientific reproducibility, or because companies wish to monetize their proprietary data and work with third parties. Either way, data sharing today starts to break organizational boundaries and become planet-scale: any individual or organization seeks the ability to easily share data and code with literally anyone else in the world, enforcing access policies and logging every single action.
But how would you typically share data today?
Store data in a database. Databases offer advanced access control and logging features, but (1) you inherit all the problems described in previous sections and (2) you are limited within the organization that runs the database and the database cluster resources you have allocated.
Store files on a cloud object store. You can always store your data on a cloud object store and use its features to manage data access. However, an object store is designed to store files and, therefore, it offers file semantics when it comes to access control and logging. If your application requires the creation of numerous files and fine-grained access policies (on a byte range level), then sharing selected data with others becomes extremely cumbersome. In addition, cloud object stores were not designed to support user policies in the order of millions for each object and, therefore, cannot be considered as planet-scale solutions.
The fact that we built a powerful storage engine on a universal data format allows us to address the above problems and provide planet-scale sharing in a holistic way by building a unified platform on top of TileDB Embedded.
We feel quite confident that we are on a great track with data storage and access control. But how about computations? Through interactions with users and personal experiences, we made some further observations:
Data scientists really like Jupyter notebooks, as they provide a great way to run and share code, while combining comprehensive documentation in text form along with the code. Moreover, it is faster and cheaper to have a hosted notebook on the cloud close to the data one wishes to access (e.g., on EC2 instances in the same region as the S3 bucket storing the data). Manually spinning up such hosted notebooks is rather cumbersome.
There are certain computations, such as SQL queries or arbitrary user-defined functions (UDFs) in a variety of languages, that the users would wish to perform on cloud-stored or shared data, in a totally serverless manner and avoiding transferring and copying data across multiple hops. This is because it is easier (no need to spin up cloud instances and clusters), cheaper (no idle compute costs) and faster (compute is sent to the data).
Any complex out-of-core or distributed algorithm working on large quantities of data can be expressed as a direct acyclic graph (DAG) of simple tasks, where each task slices and operates on a small block of data. For example, Dask relies on this task execution model.
In order to support planet-scale sharing as explained in the previous section, we were required to build a powerful, scalable and entire serverless infrastructure for TileDB Cloud. This infrastructure enabled us to provide further generic serverless compute capabilities.
From monolithic databases, data management evolved to compute engines with pluggable storage, and now we take it one step further with TileDB: we introduce the concept of universal data engines with pluggable compute. The main benefit is that data management features great similarity across all different applications. That includes data modeling, storage, slicing, access control, logging and interoperability. These features can be built once and be shared with every higher level application that needs to perform some more specialized computations. Abstracting data management and pushing common primitives down to a common storage engine eliminates obstacles and saves incredible amounts of time for data scientists and analysts that wish to focus on the science instead of the engineering, but also helps developers build brilliant new computational engines and algorithms on top of the universal data engine.