Planet-scale Sharing

Access control is important in organizations that wish to protect their data with tiered access policies, depending on which members accesses which part of the data. Moreover, logging is critical so that organizations can audit the access on their data at any point if needed. Traditional databases have always been very good at providing excellent access control and logging.

Why is access control and logging a challenging problem today?

  • The separation of storage from compute and the focus on execution engines (such as Dask, Spark, etc) rather than data management motivated users to adopt data lakes, i.e., store their data in various open-source formats (e.g., CSV, Parquet, etc) as simple files on cloud object stores. Cloud object stores provide file-based semantics when it comes to access control and logging. This makes it extremely cumbersome to manage access to datasets that consist of multiple files (especially in the case of updates and time traveling), and when organizations need fine-grained access policies (i.e., constrain access to certain byte ranges in each file).

  • Users wish to share data beyond organizational boundaries, either to collaborate and conduct reproducible science, or monetize their datasets. Although cloud object stores do offer extreme scalability, sharing data with appropriate access policies at planet-scale requires building extra infrastructure on top of the cloud object store capabilities (which operate with file-based semantics).

‚ÄčTileDB Cloud address the above problems with an entire planet-scale infrastructure that allows anyone to share their data within or outside their organizations, using array semantics rather than file-based semantics.

Here is how TileDB Cloud achieves planet-scale sharing:

  • The user stores her data in the open-source TileDB format in their own cloud object store buckets, and easily registers them on TileDB Cloud. The user continues to own the data without vendor lock-in, i.e., the data continues to be accessible by TileDB Embedded and outside TileDB Cloud.

  • The user can specify access policies on the arrays they registered to TileDB Cloud. Then TileDB Cloud is responsible for securely enforcing the user-defined access policies.

  • The user can slice data from an array they have access to easily and in a completely serverless manner. This means that the data owner does not need to create any infrastructure (e.g., spin up a cluster of given size) to accommodate the other users' requests. TileDB Cloud possesses the infrastructure to transparently scale to satisfy array slicing from any number of users at planet-scale. The users can optionally continue to use their own compute infrastructure if they wish (e.g., a Spark cluster), while each worker can easily, efficiently and scalably slice any array data with access control from TileDB Cloud. Alternatively, the user can use TileDB Cloud's serverless compute capabilities.

  • Every action on TileDB Cloud is logged and accessible to users for further auditing.