The challenges we discussed in The Problem section call for a bold, groundbreaking and holistic approach. We are the first to make the observation that data management is actually not very different across application domains, despite how different the data and terminology looks like. Therefore, we decided to invent one data model, one data format and one storage engine to provide a foundational solution for data management for all verticals. On top of those, we built extra management layers for access control and logging, in order to support data sharing and collaboration at planet scale. Our efforts led to the creation of a powerful serverless platform for designing and executing any distributed algorithm, which offers ease of use, performance, and low cost. We explain these contributions below.
Regardless of the application domain and data types, the data storage and access requirements that are common:
Data compression and, thus, data chunking for efficient selective data retrieval
Support for multiple storage backends and extendibility to new ones
Parallelism for (de)compression and other data filtering
Minimization of IO requests, which further requires:
A lightweight protocol between the client and the storage backend
Collocation of data that are frequently fetched together in the files
An embeddable storage library for easy use by higher level applications
Data versioning and time traveling
Atomicity, concurrency and consistency for reads and write operations
A growing set of efficient language APIs for flexible data access
A growing set of integrations with SQL engines and other data science tools
It does not really matter if the data is an image, or a cohort of genomic variants, or 3D LiDAR points, or dataframes, or flat binary objects. Anyone that attempts to build a performant storage engine and wishes to be used by diverse programing languages and tools will have to consider the above bullet points.
Another important observation is that there indeed exists a single data structure that can model any data type: (dense or sparse) multi-dimensional arrays. Arrays are more flexible and diverse than dataframes and can help in designing and implementing highly performant storage engines. Multi-dimensional (ND) slicing is arguably the most frequently used operator in any computational workload, e.g., for fast initial data filtering or for implementing out-of-core (to scale beyond RAM) or distributed (to scale beyond a single machine) algorithms.
Through our interactions with users and customers across a variety of verticals, it is evident that people are struggling to securely share data and code with others, even beyond their organizations. This can be for collaboration and scientific reproducibility, or because companies wish to monetize their proprietary data and work with third parties. Either way, data sharing today starts to break organizational boundaries and become planet-scale: any individual or organization seeks the ability to easily share data and code with literally anyone else in the world, enforcing access policies and logging every single action.
But how would you typically share data today?
Store data in a database. Databases offer advanced access control and logging features, but (1) you inherit all the problems described in previous sections and (2) you are limited within the organization that runs the database and the database cluster resources you have allocated.
Store files on a cloud object store. You can always store your data on a cloud object store and use its features to manage data access. However, an object store is designed to store files and, therefore, it offers file semantics when it comes to access control and logging. If your application requires the creation of numerous files and fine-grained access policies (on a byte range level), then sharing selected data with others becomes extremely cumbersome. In addition, cloud object stores were not designed to support user policies in the order of millions for each object and, therefore, cannot be considered as planet-scale solutions.
The fact that we built a powerful storage engine on a universal data format allows us to address the above problems and provide planet-scale sharing in a holistic way by building a unified platform on top of TileDB Embedded.
We feel quite confident that we are on a great track with data storage and access control. But how about computations? Through interactions with users and personal experiences, we made some further observations:
Data scientists really like Jupyter notebooks, as they provide a great way to run and share code, while combining comprehensive documentation in text form along with the code. Moreover, it is faster and cheaper to have a hosted notebook on the cloud close to the data one wishes to access (e.g., on EC2 instances in the same region as the S3 bucket storing the data). Manually spinning up such hosted notebooks is rather cumbersome.
There are certain computations, such as SQL queries or arbitrary user-defined functions (UDFs) in a variety of languages, that the users would wish to perform on cloud-stored or shared data, in a totally serverless manner and avoiding transferring and copying data across multiple hops. This is because it is easier (no need to spin up cloud instances and clusters), cheaper (no idle compute costs) and faster (compute is sent to the data).
Any complex out-of-core or distributed algorithm working on large quantities of data can be expressed as a direct acyclic graph (DAG) of simple tasks, where each task slices and operates on a small block of data. For example, Dask relies on this task execution model.
In order to support planet-scale sharing as explained in the previous section, we were required to build a powerful, scalable and entire serverless infrastructure for TileDB Cloud. This infrastructure enabled us to provide further generic serverless compute capabilities.
From monolithic databases, data management evolved to compute engines with pluggable storage, and now we take it one step further with TileDB: we introduce the concept of universal data engines with pluggable compute. The main benefit is that data management features great similarity across all different applications. That includes data modeling, storage, slicing, access control, logging and interoperability. These features can be built once and be shared with every higher level application that needs to perform some more specialized computations. Abstracting data management and pushing common primitives down to a common storage engine eliminates obstacles and saves incredible amounts of time for data scientists and analysts that wish to focus on the science instead of the engineering, but also helps developers build brilliant new computational engines and algorithms on top of the universal data engine.