1 of 9

PrestoDB

Quickstart

TileDB-Presto is a data source connector for PrestoDB, which allows you to run SQL queries on TileDB arrays. The connector supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.

The TileDB-Presto connector supports most SQL operations from PrestoDB. Arrays can be referenced dynamically and are not required to be "pre-registered" with Presto. No external service (such as Apache Hive) is required.

A docker image is provided to allow for quick testing of the TileDB-Presto connector. The docker image starts a single-node Presto cluster and opens the CLI Presto interface where SQL can be run. The image includes two example tiledb arrays:

/opt/tiledb_example_arrays/dense_global(dense array)
/opt/tiledb_example_arrays/sparse_global(sparse array)

Simply run:

# Run PrestoDB
docker run -it --rm tiledb/tiledb-presto

# Run PrestoDB adding your S3 access keys as env variables
docker run -e AWS_ACCESS_KEY_ID="<key>" -e AWS_SECRET_ACCESS_KEY="<secret>" -it tiledb/tiledb-presto

# Run PrestoDB by mounting an existing local array
docker run -it --rm -v /local/array/path:/data/local_array tiledb/tiledb-presto

You can run a quick example to see if it works:

show columns from "file:///opt/tiledb_example_arrays/dense_global";

 Column |  Type   | Extra |  Comment  
--------+---------+-------+-----------
 rows   | integer |       | Dimension 
 cols   | integer |       | Dimension 
 a      | integer |       | Attribute

select * from "file:///opt/tiledb_example_arrays/dense_global" 
WHERE rows = 3 AND cols between 1 and 2;

 rows | cols | a 
------+------+---
    3 |    1 | 5 
    3 |    2 | 6

It is possible to specify a file that contains SQL to be run from the docker image:

echo 'select * from "file:///opt/tiledb_example_arrays/dense_global" limit 10;' > example.sql
docker run -it --rm -v ${PWD}/example.sql:/tmp/example.sql tiledb/tiledb-presto /opt/presto/bin/entrypoint.sh --file /tmp/example.sql

You can also run a SQL statement directly:

docker run -it --rm tiledb/tiledb-presto /opt/presto/bin/entrypoint.sh --execute 'select * from "file:///opt/tiledb_example_arrays/dense_global" limit 10;'

Configuration

Plugin Parameters

A single configuration file is needed. The config file should be placed in the catalog folder (e.g.,/etc/presto/conf/catalog on EMR) and named tiledb.properties.

Sample file contents:

connector.name=tiledb
# Set read buffer to 10M per attribute
read-buffer-size=10485760

The following parameters can be configured in the tiledb.properties and are plugin-wide.

Session Parameters

These can be set as follows:

set session tiledb.<param>=<value>
// E.g., set session tiledb.splits=10

Unset session parameters inherit the plugin configuration defaults. The list of session parameters is summarized below"

Table properties

These are set upon table creation as follows:

create table my_table(
  ...
  ) with (uri = '<array-uri>', type='SPARSE', cell_order='ROW_MAJOR`, ...);

Column Parameters

These are set upon table creation as follows:

create table my_table(
  dim0 bigint with (dimension=true, lower_bound=0, upper_bound=100, extent=10),
  ...
  ) with (uri = '<array-uri>', type = 'SPARSE');

Installation From Source

Currently, the TileDB-Presto connector is built as a plugin. It must be packaged and installed on the PrestoDB instances. You can download the latest release or build the connector from source using the following command from the top level directory of the TileDB-Presto repo.

./mvnw package

# Tests can be skipped as follows
./mvnw package -DskipTests

To install the plugin on an existing Presto instance, you need to copy the path/to/TileDB-Presto/target/presto-tiledb-<version> folder to a tiledb directory under the plugin directory on echo Presto node. On AWS EMR, this directory is /usr/lib/presto/plugin/tiledb/.

Usage

Creating a New TileDB Array

The example below demonstrates creation of a TileDB array through Presto. Note that some array schema options are not currently supported from Presto (see Limitations for more details).

create table my_table(
  dim0 bigint with (dimension=true, lower_bound=0, upper_bound=100, extent=10),
  dim1 bigint with (dimension=true, lower_bound=0, upper_bound=100, extent=10),
  attr1 varchar
  ) with (uri = '<array-uri>', type = 'SPARSE');

<array-uri> can be any path, local (e.g., file://) or remote (e.g., s3://).

You can see the array schema as follows:

show create table tiledb.tiledb`<array-uri>`;

A TileDB array created through PrestoDB is and behaves exactly like any other TileDB array. Therefore, it is accessible by all TileDB APIs (e.g., Python) and integrations (e.g., Spark).

Querying TileDB Arrays

PrestoDB can dynamically discover existing TileDB arrays, i.e., even if they were created and populated externally from PrestoDB. Therefore, you can just insert data into a TileDB array or query it as follows:

insert into tiledb.tiledb.<array-uri> (dim0, dim1, attr1) 
values (1, 1, 'cell 1'), (1, 2, 'cell 2'), (2, 1, 'cell 3');

// Read the array
select * from tiledb.tiledb.<array-uri>;

Presto uses the form of catalog.schema.<array-uri> for querying. TileDB does not have a concept of a table schema, so any valid string can be used for the schema name when querying and tiledb is used only for convenience in the examples. <array-uri> is the array URI and can be local (file://) or remote (s3://).

Supported Datatypes

PrestoDB and TileDB have slight differences in their supported datatypes. This document serves as a mapping between the (core) TileDB datatypes and the MariaDB datatypes for easy reference.

Presto and all connectors are written in Java, and Java does not have unsigned values. As a result, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type oflong.

Special cases ofchar(1)orvarchar(1)are stored on disk as fixed-sized attributes of size 1. Any char/varchar greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter, but Presto will for inserts.

Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.

Limitations

The TileDB connector supports most Presto functionality. Below is a list of the features not currently supported.

Encrypted Arrays

The connector does not currently support creating/writing/reading encrypted arrays

OpenAt Timestamp

The connector does not currently support the TileDB openAt functionality to open an array at a specific timestamp.

Datatypes

TileDB Presto connector supports the following SQL datatypes:

BOOLEAN
TINYINT
INTEGER
BIGINT
REAL
DOUBLE
DECIMAL (treated as doubles)
STRING*
VARCHAR*
CHAR*
VARBINARY

No other datatypes are supported.

Unsigned Integers

The TileDB Presto connector does not have full support for unsigned values. Presto and all connectors are written in Java, and Java does not have unsigned values. As a result of this Java limitation, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type of long.

Variable-length Char/Varchar fields

For varchar, and char datatypes the special case of char(1) or varchar(1) is stored on disk as a fixed-sized attribute of size 1. Any char/varchar greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter but Presto will for inserts.

Decimal Type

Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.

Create Table

Create table is supported, however only a limited subset of TileDB parameters is supported.

No support for creating encrypted arrays
No support for setting custom filters on attributes, coordinates or offsets

Splits

The current split implementation is naive and splits domains evenly with user defined predicates (WHERE clause) or from the non-empty domains. This even splitting will likely produce sub optimal splits for sparse domains. Future work will move splitting into core TileDB where better heuristics will be used to produce even splits.

For now, if splits are highly uneven consider increasing the number of splits via the tiledb.splits session parameter or add where clauses to limit the data set to non-empty regions of the array.

SQL

This document contains all custom SQL options defined by the TileDB Presto connector.

Create Table

The following properties can be configured for creating a TileDB array in Presto.

Table properties

Column Properties

Examples

Below are various examples for querying data with the TileDB Presto connector.

SQL Examples

Selecting Data

Typical select statements work as expected. This include predicate pushdown for dimension fields.

Select all columns and all data from an array:

SELECT * FROM tiledb.tiledb."file:///opt/tiledb_example_arrays/dense_global"

Select subset of columns:

SELECT rows, cols FROM tiledb.tiledb."file:///opt/tiledb_example_arrays/dense_global"

Select with predicate pushdown:

SELECT * FROM tiledb.tiledb."file:///opt/tiledb_example_arrays/dense_global" WHERE rows between 1 and 2

Showing Query Plans

Get the query plan without running the query:

EXPLAIN SELECT * FROM tiledb.tiledb."file:///opt/tiledb_example_arrays/dense_global" WHERE rows between 1 and 2

Analyze the query but running and profiling:

EXPLAIN ANALYZE SELECT * FROM tiledb.tiledb."file:///opt/tiledb_example_arrays/dense_global" WHERE rows between 1 and 2

Creating a TileDB Array

It is possible to create TileDB array from Presto. Not all array schema options are currently supported from Presto though (see Limitations for more details).

Minimum create table:

CREATE TABLE region(
  regionkey bigint WITH (dimension=true),
  name varchar,
  comment varchar
  ) WITH (uri = 's3://bucket/region')

Create table with all options specified:

CREATE TABLE region(
  regionkey bigint WITH (dimension=true, lower_bound=0, upper_bound=3000, extent=50),
  name varchar,
  comment varchar
  ) WITH (uri = 's3://bucket/region', type = 'SPARSE', cell_order = 'COL_MAJOR', tile_order = 'ROW_MAJOR', capacity = 10)

Inserting Data

Data can be inserted into TileDB arrays through Presto. Inserts can be from another table or individual values.

Copy data from one table to another:

INSERT INTO tiledb.tiledb."s3://bucket/region" select * from tpch.tiny.region

Data can be inserted using the VALUES method for single row inserts. This is not recommended because each insert will create a new fragment and cause degraded read performance as the number of fragments increases.

INSERT INTO tiledb.tiledb."s3://bucket/region" VALUES (1, "Test Region", "Example")