Universal Data Management
The need to manage data has existed for decades and there are hundreds of different data management solutions available today. What is TileDB and how does it innovate in such a crowded space?
In this section we explain what we define as "universal data management" and the practical problem it solves. We start with the main motivation behind TileDB.
TileDB started as a research project at Intel Labs and MIT in 2014, where we made some observations.
- Most database systems deal with one type of data, mostly tables.
- Most data out there is not really tabular. Look at satellite imaging, biomedical imaging, video, weather, point cloud, and many others.
- A lot of organizations maintain a lot of diverse data. For example, hospitals have clinical records, but also genomics, MRI scans, etc. Insurance companies may have weather data coupled with satellite imaging. Telecommunications companies may have customer records, in addition to LiDAR and location data.
- Teams within an organization may be using a variety of programming languages and tools. Some may use SQL, but others may prefer Python or R. Some may wish to perform data analytics, but others may want to run Machine Learning tasks or simply visualize the data.
- Collaboration and governance within and across organizations is very challenging when the data (and code) does not live within a centralized database that would ordinarily manage access policies and maintain logs for auditing.
Since there is no database that can tackle the above challenges, organizations often resort to building specialized data solutions in-house. This takes a lot of time, costs a lot of money, and often involves combining a lot of disparate data software which make data management even more challenging.
Having those observations in mind, we asked ourselves a few questions.
- Data storage
- Is there a single data structure that can capture all data? In our mind, tables are constrained, whereas key-values, documents and graphs do not seem to be efficient models for data like images, video, weather, point clouds, genomics, and more.
- Is there a way to abstract all storage in a way that the system can work on any backend (memory, cloud object store, or other)?
- Common layers
- What are the common components of a "data management" solution, regardless of the application domain (e.g., storage layer, an authentication layer, APIs, access control, logging, etc.)? Could these be common in Genomics, Earth Observation, Time Series, etc?
- Data access
- Can we abstract all access in a way that the system can efficiently work with any API or tool?
- Can we scale access control to any number of users, anywhere in the world and beyond the limits of a data center?
- Can we share code in a similar way to sharing data? In other words, can we treat code as data?
- Can we enable users to sell data and code they share with others?
- Can we take multi-tenancy to the extreme, without being constrained by clusters. Can we scale easily and elastically?
- Does a scalable "data infrastructure" also imply a scalable "compute infrastructure"? In other words, if we build the former, can we also gain the foundation for the latter?
- Future proofness
- Is there a way to make the system future proof? That is, can we build it in a way that in the future we can easily extend it with any storage backend, any language API, any data analytics and visualization tool, any new hardware, and practically any technological advancement?
And while contemplating about the answers to these questions, we developed TileDB and introduced the concept of universal data management.
There is a need for rearchitecting data management from the ground up, in a universal way that is flexible and extendible to future user needs and technological advancements. Below we describe the most important aspects of TileDB Cloud that make it a universal data management system.
We needed a single data format and storage engine that could handle all types of data. We observed that multi-dimensional arrays constitute a great candidate for that. We can prove that any data, from biomedical imaging to genomics to SAR to tables to anything, can be modeled very efficiently with (dense or sparse) multi-dimensional arrays. Also multi-dimensional arrays are the currency of data analytics and machine learning, because a lot of advanced mathematical computations (e.g., using Linear Algebra) are applied to vectors, matrices and tensors -- in other words, multi-dimensional arrays!
- Dense and sparse multi-dimensional array support
- "Columnar" format and compression
- Multi-threading and parallel IO
- Cloud-optimized implementation
- Rapid updates and array slicing
- Data versioning and time traveling
- Arbitrary metadata stored along with the array data
TileDB Cloud relies on TileDB Embedded for data (and code) storage. All code written with TileDB Embedded can be used with TileDB Cloud by changing only a few configuration parameters. This allows the users to test their code locally, and transition to the cloud offering by changing 1-2 lines of code.
TileDB authentication and access control works as follows:
- The user stores their data in the open-source TileDB Embedded array format on some scalable shared storage backend (e.g., AWS S3).
- The user owns the data, TileDB Cloud does not do any hosting. The user only registers the array with TileDB Cloud, granting authentication keys to TileDB Cloud for accessing the data.
- The user can create organizations, and share data and code with other organizations and users with various access policies. There is no bound on the number of users and organizations one can share data and code with. Users can collaborate with anyone within or beyond their organization.
- When data and code is accessed, TileDB Cloud is responsible for securely checking and enforcing all the appropriate access policies.
There is no need to manage IAM roles, or Apache Sentry/Range setups anymore. TileDB Cloud handles everything transparently.
In addition, TileDB Cloud allows users to make data and code public, attaching descriptions, metadata and arbitrary tags. The data and code can then be discovered and used by any other TileDB Cloud user on the planet.
All access to arrays and code is logged and can be viewed for auditing purposes. TileDB Cloud allows users to keep track of how their shared or public arrays are being used and gain valuable insights.
TileDB Cloud enjoys the extreme interoperability offered by TileDB Emebedded (i.e., numerous language APIs and tool integrations). In addition, TileDB Cloud is constantly being extended to support more languages and tools for the added cloud features it provides (e.g., see serverless compute and Jupyter notebooks).
All the APIs and integrations (existing and future ones) inherit the authentication, access control and logging functionality we built directly on top of array storage. In other words, modeling all data universally as arrays allowed us to build a single layer for authentication, access control and logging, instead of building custom support for all the data types and APIs/tools used across different applications.
Building a universal data management system that can provide extreme multi-tenancy and scale requires building an entire distributed system infrastructure from scratch. Our implementation revealed additional capabilities that proved to be very valuable for scalable data analytics, ease of use, extracting monetary value from data and code and remaining relevant in the rapidly paced data technology space. Therefore, we gradually exposed these capabilities within TileDB Cloud, described below.
We outlined the following requirements around accessing arrays registered with TileDB Cloud:
- 1.Any user on the planet with appropriate access policies should be able to access data at any time.
- 2.There should be no limit on how many users can simultaneously access an array.
- 3.The user who accesses the array should not be responsible for spinning up dedicated machines.
- 4.The user who shares the array should not be responsible for spinning up dedicated machines.
The architecture we built to meet these requirements resulted in the following:
- 1.Totally "serverless" compute from the user's stand point. Any request "just works" without reserving resources in advance.
- 2.TileDB Cloud uses an elastic compute infrastructure, which automatically expands and shrinks based on user demand.
- 3.The user is charged in a pay-as-you-go fashion, and only for compute and data egress they consume.
- 4.The compute is sent to the data, respecting geographical cloud storage regions to eliminate egress cloud provider costs and maximize performance.
Users wanted to do more than just slicing arrays. For example, they wished to run advanced SQL queries and user-defined functions (UDFs), i.e., arbitrary code in Python, R or other language, potentially using external libraries and integrations, and manipulating the data efficiently, securely and inexpensively. In the most general scenario, users wanted to create task graphs, i.e., task workflows that can implement sophisticated distributed algorithms to take advantage of the computational power and ease of use of TileDB Cloud. But this functionality could readily be provided by the infrastructure we built. Therefore, we optimized it and exposed it.
We also took this one step further. SQL queries and UDFs are runnable code. But it is also shareable data. We stored UDFs as TileDB arrays (recall that arrays can model any data) and we unlocked all the TileDB Cloud capabilities even for UDFs, such as sharing, logging and exploration of public code.
A lot of data scientists find it convenient to use Jupyter notebooks for writing code and performing exploratory analysis. TileDB Cloud allows launching JupyterLab instances within the online console. The instances come with prepackaged images that include useful libraries, but the user can also install any library inside the Jupyter environment. TileDB Cloud handles this within its distributed infrastructure, without requiring the user to manually deploy servers.
Jupyter notebooks have become a standard tool for scientific analysis and reproducibility. Therefore, TileDB Cloud allows users to share notebooks in the same manner as arrays. Users can also make notebooks public, or explore notebooks that others have shared with the world.
TileDB Cloud logs everything for auditing purposes. The tasks, duration, cost, resources used, user information, etc. As such, it had all the functionality needed to develop a full-fledged marketplace to allow users monetize their code and data based on the usage from other users. TileDB Cloud integrates with Stripe and handles all billing and accounting for users that wish to sell or buy data and code on a pay-as-you-go-fashion.
This is convenient for several reasons:
- 1.Sellers do not need to ship potentially huge quantities of data to buyers.
- 2.Sellers do not need to build their own infrastructure to serve data and code, as well as perform all billing and accounting.
- 3.Buyers can perform exploration and analysis on data from multiple vendors in a single platform.
- 4.Buyers do not need to download and host the potentially massive quantities they are purchasing.
- 5.A pay-as-you-go model is an alternative, more flexible model to the standard annual license model, which may be more economical for both buyers (due to scale) and sellers (due to paying only for what they use).
So far we have described that TileDB Cloud is universal in the following respects:
- 1.Data: TileDB can manage any data type, and hence can support any future data type.
- 2.Storage backends: TileDB abstracts the backend layer and thus can easily add support for new backends (on the cloud, in memory, or other).
- 3.APIs and tools: TileDB is all about extreme interoperability and it is designed to easily add support for any new popular language and tool.
- 4.Deployment: TileDB is cloud and data center agnostic and therefore can be deployed anywhere.
- 5.Hardware: TileDB is being implemented in a way that can benefit from hardware accelerators, and boost performance in clusters with heterogeneous instances.
- 6.Algorithms: TileDB allows the development of any arbitrary distributed algorithm (from SQL to Linear Algebra to genomics pipelines), which can easily be shared and improved through collaboration.
To sum up, TileDB Cloud is flexible and can adapt to change throughout its lifetime in an organization's software stack. User requirements and creativity around data processing continually increase. TileDB Cloud remains valuable and relevant by evolving based on user feedback, rather than becoming obsolete.