Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The example below demonstrates creation of a TileDB array through Presto. Note that some array schema options are not currently supported from Presto (see Limitations for more details).
<array-uri>
can be any path, local (e.g., file://
) or remote (e.g., s3://
).
You can see the array schema as follows:
A TileDB array created through PrestoDB is and behaves exactly like any other TileDB array. Therefore, it is accessible by all TileDB APIs (e.g., Python) and integrations (e.g., Spark).
PrestoDB can dynamically discover existing TileDB arrays, i.e., even if they were created and populated externally from PrestoDB. Therefore, you can just insert data into a TileDB array or query it as follows:
Presto uses the form of catalog.schema.<array-uri>
for querying. TileDB does not have a concept of a table schema, so any valid string can be used for the schema name when querying and tiledb
is used only for convenience in the examples. <array-uri>
is the array URI and can be local (file://
) or remote (s3://
).
A single configuration file is needed. The config file should be placed in the catalog folder (e.g.,/etc/presto/conf/catalog
on EMR) and named tiledb.properties
.
Sample file contents:
The following parameters can be configured in the tiledb.properties
and are plugin-wide.
These can be set as follows:
Unset session parameters inherit the plugin configuration defaults. The list of session parameters is summarized below"
These are set upon table creation as follows:
These are set upon table creation as follows:
This document contains all custom SQL options defined by the TileDB Presto connector.
The following properties can be configured for creating a TileDB array in Presto.
Property | Description | Default Value | Possible Values | Required |
---|---|---|---|---|
Property | Description | Default Value | Possible Values | Required |
---|---|---|---|---|
is a data source connector for , which allows you to run SQL queries on TileDB arrays. The connector supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.
The TileDB-Presto connector supports most SQL operations from PrestoDB. Arrays can be referenced dynamically and are not required to be "pre-registered" with Presto. No external service (such as ) is required.
A docker image is provided to allow for quick testing of the TileDB-Presto connector. The docker image starts a single-node Presto cluster and opens the CLI Presto interface where SQL can be run. The image includes two example tiledb arrays:
/opt/tiledb_example_arrays/dense_global
(dense array)
/opt/tiledb_example_arrays/sparse_global
(sparse array)
Simply run:
You can run a quick example to see if it works:
It is possible to specify a file that contains SQL to be run from the docker image:
You can also run a SQL statement directly:
Currently, the TileDB-Presto connector is built as a plugin. It must be packaged and installed on the PrestoDB instances. You can download the or build the connector from source using the following command from the top level directory of the TileDB-Presto repo.
To install the plugin on an existing Presto instance, you need to copy the path/to/TileDB-Presto/target/presto-tiledb-<version>
folder to a tiledb
directory under the plugin directory on echo Presto node. On AWS EMR, this directory is /usr/lib/presto/plugin/tiledb/
.
Below are various examples for querying data with the TileDB Presto connector.
Typical select statements work as expected. This include predicate pushdown for dimension fields.
Select all columns and all data from an array:
Select subset of columns:
Select with predicate pushdown:
Get the query plan without running the query:
Analyze the query but running and profiling:
It is possible to create TileDB array from Presto. Not all array schema options are currently supported from Presto though (see Limitations for more details).
Minimum create table:
Create table with all options specified:
Data can be inserted into TileDB arrays through Presto. Inserts can be from another table or individual values.
Copy data from one table to another:
Data can be inserted using the VALUES
method for single row inserts. This is not recommended because each insert will create a new fragment and cause degraded read performance as the number of fragments increases.
PrestoDB and TileDB have slight differences in their supported datatypes. This document serves as a mapping between the (core) TileDB datatypes and the MariaDB datatypes for easy reference.
Presto and all connectors are written in Java, and Java does not have unsigned values. As a result, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type oflong
.
Special cases ofchar(1)
orvarchar(1)
are stored on disk as fixed-sized attributes of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter, but Presto will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
Parameter
Default
Datatype
Description
array-uris
""
String
CSV list of arrays to preload metadata on
read-buffer-size
10485760
Integer
Max read buffer size per attribute
write-buffer-size
10485760
Integer
Max write buffer size per attribute
aws-access-key-id
""
String
AWS_ACCESS_KEY_ID
for S3 access
aws-secret-access-key
""
String
AWS_SECRET_ACCESS_KEY
for S3 access
tiledb-config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Default
Datatype
Description
read_buffer_size
Plugin
Integer
Max read buffer size per attribute
write_buffer_size
Plugin
Integer
Max write buffer size per attribute
aws_access_key_id
Plugin
String
AWS_ACCESS_KEY_ID
for S3 access
aws_secret_access_key
Plugin
String
AWS_SECRET_ACCESS_KEY
for S3 access
splits
-1
Integer
Number of splits to use per query, -1 means splits will be equal to number of workers
split_only_predicates
false
Boolean
Split only based on predicates pushed down from where clause
enable_stats
false
Boolean
Enable collecting and dumping connector stats to Presto log
tiledb_config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Description
Default
Possible Values
Required
uri
Array URI
""
*
Yes
type
Array type
SPARSE
SPARSE
, DENSE
No
cell_order
Cell order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
tile_order
Tile order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
capacity
Tile capacity
10000L
>0
No
Name
Description
Default
Possible Values
Required
dimension
Column is a dimension
False
True, False
No
lower_bound
Domain lower bound
0L
Any Long Value
No
upper_bound
Domain upper bound
Long.MAX_VALUE
Any Long Value
No
extent
Tile extent
10L
Any Long Value
No
uri
URI for array to be created at
""
*
Yes
type
Array Type
SPARSE
SPARSE, DENSE
No
cell_order
Cell order for array
ROW_MAJOR
ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER
No
tile_order
Tile order for array
ROW_MAJOR
ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER
No
capacity
Capacity of sparse array
10000L
>0
No
dimension
Is column a dimension
False
True, False
No
lower_bound
Domain Lower Bound
0L
Any Long Value
No
upper_bound
Domain Upper Bound
Long.MAX_VALUE
Any Long Value
No
extent
Dimension Extent
10L
Any Long Value
No
TileDB Datatype | PrestoDB SQL Datatype |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The TileDB connector supports most Presto functionality. Below is a list of the features not currently supported.
The connector does not currently support creating/writing/reading encrypted arrays
The connector does not currently support the TileDB openAt
functionality to open an array at a specific timestamp.
TileDB Presto connector supports the following SQL datatypes:
BOOLEAN
TINYINT
INTEGER
BIGINT
REAL
DOUBLE
DECIMAL (treated as doubles)
STRING*
VARCHAR*
CHAR*
VARBINARY
No other datatypes are supported.
The TileDB Presto connector does not have full support for unsigned values. Presto and all connectors are written in Java, and Java does not have unsigned values. As a result of this Java limitation, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type of long
.
For varchar
, and char
datatypes the special case of char(1)
or varchar(1)
is stored on disk as a fixed-sized attribute of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter but Presto will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
Create table is supported, however only a limited subset of TileDB parameters is supported.
No support for creating encrypted arrays
No support for setting custom filters on attributes, coordinates or offsets
The current split implementation is naive and splits domains evenly with user defined predicates (WHERE
clause) or from the non-empty domains. This even splitting will likely produce sub optimal splits for sparse domains. Future work will move splitting into core TileDB where better heuristics will be used to produce even splits.
For now, if splits are highly uneven consider increasing the number of splits via the tiledb.splits
session parameter or add where clauses to limit the data set to non-empty regions of the array.