Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
TileDB-Trino is a data source connector for Trino, which allows you to run SQL queries on TileDB arrays. The connector supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.
The TileDB-Trino connector supports most SQL operations from Trino. Arrays can be referenced dynamically and are not required to be "pre-registered" with Trino. No external service (such as Apache Hive) is required.
The TileDB connector supports most of Trino functionality. Below is a list of the features not currently supported.
The connector does not currently support creating/writing/reading encrypted arrays
The connector does not currently support the TileDB openAt
functionality to open an array at a specific timestamp.
TileDB Trino connector supports the following SQL datatypes:
BOOLEAN
TINYINT
INTEGER
BIGINT
REAL
DOUBLE
DECIMAL (treated as doubles)
STRING*
VARCHAR*
CHAR*
VARBINARY
No other datatypes are supported.
The TileDB Trino connector does not have full support for unsigned values. Trinno and all connectors are written in Java, and Java does not have unsigned values. As a result of this Java limitation, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type of long
.
For varchar
, and char
datatypes the special case of char(1)
or varchar(1)
is stored on disk as a fixed-sized attribute of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter but Trino will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
Create table is supported, however only a limited subset of TileDB parameters is supported.
No support for creating encrypted arrays
No support for setting custom filters on attributes, coordinates or offsets
The current split implementation is naive and splits domains evenly with user defined predicates (WHERE
clause) or from the non-empty domains. This even splitting will likely produce sub optimal splits for sparse domains. Future work will move splitting into core TileDB where better heuristics will be used to produce even splits.
For now, if splits are highly uneven consider increasing the number of splits via the tiledb.splits
session parameter or add where clauses to limit the data set to non-empty regions of the array.
Below are various examples for querying data with the TileDB Trino connector.
Typical select statements work as expected. This include predicate pushdown for dimension fields.
Select all columns and all data from an array:
Select subset of columns:
Select with predicate pushdown:
Get the query plan without running the query:
Analyze the query but running and profiling:
It is possible to create TileDB array from Trino. Not all array schema options are currently supported from Trino though (see Limitations for more details).
Minimum create table:
Create table with all options specified:
Data can be inserted into TileDB arrays through Trino. Inserts can be from another table or individual values.
Copy data from one table to another:
Data can be inserted using the VALUES
method for single row inserts. This is not recommended because each insert will create a new fragment and cause degraded read performance as the number of fragments increases.
Trino and TileDB have slight differences in their supported datatypes. This document serves as a mapping between the (core) TileDB datatypes and the MariaDB datatypes for easy reference.
Trino and all connectors are written in Java, and Java does not have unsigned values. As a result, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type oflong
.
Special cases ofchar(1)
orvarchar(1)
are stored on disk as fixed-sized attributes of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter, but Trino will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
Currently, the TileDB-Τrino connector is built as a plugin. It must be packaged and installed on the Trino instances. You can download the or build the connector from source using the following command from the top level directory of the TileDB-Trino repo.
First clone Trino
Install Trino
Create a TileDB directory
Build and copy the TileDB-Trino jars to the TileDB directory
Create two nested directories "etc/catalog" which include the tiledb.properties file and move them to:
Launch the Trino Server
Launch the Trino-CLI with the TileDB plugin
This document contains all custom SQL options defined by the TileDB Trino connector.
The following properties can be configured for creating a TileDB array in Trino.
Property | Description | Default Value | Possible Values | Required |
---|
Property | Description | Default Value | Possible Values | Required |
---|
It is possible to create TileDB array from Trino. Not all array schema options are currently supported from Trino though (see for more details). An example is shown below.
Note that <array-uri>
can be any path, local (e.g., file://
) or remote (e.g., s3://
).
You can see the array schema as follows:
A TileDB array created through Trino is and behaves exactly like any other TileDB array. Therefore, it is accessible by all TileDB APIs (e.g., Python) and integrations (e.g., Spark).
Trino can dynamically discover existing TileDB arrays, i.e., even if they were created and populated externally from Trino. Therefore, you can just insert data into a TileDB array or query it as follows:
Trino uses the form of catalog.schema.<array-uri>
for querying. TileDB does not have a concept of a table schema, so any valid string can be used for the schema name when querying and tiledb
is used only for convenience in the examples. <array-uri>
is the array URI and can be local (file://
) or remote (s3://
).
TileDB Datatype | Trino SQL Datatype |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
uri | URI for array to be created at | "" | * | Yes |
type | Array Type | SPARSE | SPARSE, DENSE | No |
cell_order | Cell order for array | ROW_MAJOR | ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER | No |
tile_order | Tile order for array | ROW_MAJOR | ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER | No |
capacity | Capacity of sparse array | 10000L | >0 | No |
dimension | Is column a dimension | False | True, False | No |
lower_bound | Domain Lower Bound | 0L | Any Long Value | No |
upper_bound | Domain Upper Bound | Long.MAX_VALUE | Any Long Value | No |
extent | Dimension Extent | 10L | Any Long Value | No |
A single configuration file is needed. The config file should be placed in the catalog folder (e.g.,/etc/trino/conf/catalog
on EMR) and named tiledb.properties
.
Sample file contents:
The following parameters can be configured in the tiledb.properties
and are plugin-wide.
These can be set as follows:
Unset session parameters inherit the plugin configuration defaults. The list of session parameters is summarized below"
These are set upon table creation as follows:
These are set upon table creation as follows:
Parameter
Default
Datatype
Description
array-uris
""
String
CSV list of arrays to preload metadata on
read-buffer-size
10485760
Integer
Max read buffer size per attribute
write-buffer-size
10485760
Integer
Max write buffer size per attribute
aws-access-key-id
""
String
AWS_ACCESS_KEY_ID
for S3 access
aws-secret-access-key
""
String
AWS_SECRET_ACCESS_KEY
for S3 access
tiledb-config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Default
Datatype
Description
read_buffer_size
Plugin
Integer
Max read buffer size per attribute
write_buffer_size
Plugin
Integer
Max write buffer size per attribute
aws_access_key_id
Plugin
String
AWS_ACCESS_KEY_ID
for S3 access
aws_secret_access_key
Plugin
String
AWS_SECRET_ACCESS_KEY
for S3 access
splits
-1
Integer
Number of splits to use per query, -1 means splits will be equal to number of workers
split_only_predicates
false
Boolean
Split only based on predicates pushed down from where clause
enable_stats
false
Boolean
Enable collecting and dumping connector stats to Trino log
tiledb_config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Description
Default
Possible Values
Required
uri
Array URI
""
*
Yes
type
Array type
SPARSE
SPARSE
, DENSE
No
cell_order
Cell order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
tile_order
Tile order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
capacity
Tile capacity
10000L
>0
No
Name
Description
Default
Possible Values
Required
dimension
Column is a dimension
False
True, False
No
lower_bound
Domain lower bound
0L
Any Long Value
No
upper_bound
Domain upper bound
Long.MAX_VALUE
Any Long Value
No
extent
Tile extent
10L
Any Long Value
No