Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
We have created two handy scripts for setting up TileDB-Spark and Apache Arrow on an EMR cluster. Arrow is optional but will increase performance if you use PySpark or SparkR.
EMR requires that the bootstrap scripts be copied to an S3 bucket. You can sync the scripts from our TileDB-Spark repo to S3 as follows:
Create the EMR cluster as follows:
From the AWS EMR console, click on "Create Cluster".
Click on link "Go to advanced options".
In Step 1, make sure Spark is selected.
In Step 3, click on "Bootstrap Actions", then select a custom action, and click on "Configure and add". For the "Script location", you will need to point to where you have uploaded the bootstrap scripts, eg. s3://my_bucket/path/emr_bootstrap/install-tiledb-spark.sh
.
Continue to create the cluster. It typically takes 10-20 minutes for the cluster to be ready.
Follow the same procedure as above, but in Step 3 add one more bootstrap action, providing the location of our CRAN packages script, e.g., s3://my_bucket/path/emr_bootstrap/install-cran-packages.sh
. Moreover, under, "Optional arguments" you must add --packages arrow
(potentially adding any other CRAN package of your choice).
TileDB-Spark provides a metric source to collect timing and input metric details. This can be helpful in tracking performance of TileDB and the TileDB-Spark driver.
In Step 1 of the EMR launch cluster console, there is a section "Edit software settings". Paste the following json config, which will enable the spark metrics source from TileDB:
Spark is a very popular analytics engine for large-scale data processing. It allows users to create distributed arrays and dataframes, use machine learning libraries, perform SQL queries, etc. TileDB-Spark is TileDB's datasource driver for Spark, which allows the user to create distributed Spark dataframes from TileDB arrays and, thus, process TileDB data with familiar tooling at great scale with minimum code changes.
TileDB offers a prebuilt uber jar that contains all dependencies. This can be used on most Spark clusters to enable the TileDB-Spark datasource driver.
The latest jars can be downloaded from Github.
Compiling TileDB-Spark from source is simple:
This will create a jar file /path/to/TileDB-Spark/build/libs/tiledb-spark-<version>.jar
.
To launch a spark shell with TileDB-Spark enabled simply point Spark to the jar you have obtained:
Reporting metrics is supported via dropwizard
and the default Spark metrics setup. The metrics can be enabled by adding the following lines to your/etc/spark/conf/metric.properties
file:
When loading an application jar (i.e. via the --jar
CLI flag when launching a Spark shell) the metrics are available to the master node and the driver
metrics will report. However, the executors will error about a class not found. This is because on each worker node a jar containing the org.apache.spark.metrics.TileDBMetricsSource
must be provided in the class path. To address this, you must copy our dedicated path/to/TileDB-Spark/build/libs/tiledb-spark-metrics-<version>.jar
to $SPARK_HOME/jars/
.
Spark has a large number of configuration parameters that can affect the performance of both the TileDB driver and the user application. In this page we provide some performance tuning tips.
TileDB-Spark uses TileDB-Java and the underlying C++ libtiledb
for parallel IO, compression and encryption. As such, for optimized read performance you should limit the Spark executors to one per machine and give each single executor all the resources of that machine.
Set the following Spark configuration parameters:
It is important to set the Spark task CPUs to be the same as the number of executor cores. This prevents Spark from putting more than one read partition on each machine. The executor memory is set to 80% of available memory to allow for overhead on the host itself.
If using Yarn, the above configuration parameters are likely not enough. You will need to also configure Yarn similarly.
There are two main TileDB Driver Options to tweak for optimizing reads, partition_count
and read_buffer_size
. The partition_count
should be set to the number of executors, which is the number of machines in the cluster.
read_buffer_size
should be set to at least 104857600 (100MB). Larger read buffers are critical to reduce the number of incomplete queries. The maximum size of the read buffers is limited based on the available RAM. If you use Yarn, the maximum buffer is also constrained by spark.yarn.executor.memoryOverhead
. TileDB read/write buffers are stored off-heap.
There are applications that rely on using multiple Spark tasks for parallelism, constraining each task to run on a single thread. This is common for most PySpark and SparkR applications. Below we describe how to configure Spark and the TileDB data source for optimized performance in this case.
Set the following Spark configuration parameters:
If you use Yarn, the above configuration parameters are likely not enough. You will need to also configure Yarn similarly.
The read_buffer_size
should be set to the largest value possible given the executor's available memory. TileDB typically has a memory overhead of 3x, and therefore 3 * read_buffer_size
should be less than the Spark's off-heap maximum memory. If you use Yarn, this value is defined in spark.yarn.executor.memoryOverhead
. A default value of 10MB
is usually sufficient.
Thepartition_count
should be set to a value of data size being read
/ read_buffer_size
. If the data size is not known, then set the partition count to the number of executors. This might lead to over partitioning, as such you might want to try different sizes until you find an optimal size for your dataset.
Finally, it is important to set several of the TileDB parallelism configuration parameters in the Spark option()
dataframe commands upon reading:
The TileDB-Spark data source allows you to specify a partition count when reading a TileDB Array into a distributed Spark dataframe, via the partition_count
option. An example is shown below. This creates evenly sized partitions across all array dimensions, based on the volume of the subarray, in order to balance the computational load across the workers.
dask.delayed
is a powerful feature of Dask that allows you to create arbitrary task graphs and submit them to Dask's scheduler for execution. You can be truly creative with that functionality and implement sophisticated out-of-core computations (i.e., on larger than RAM datasets) and handle highly distributed workloads.
There is no special integration needed with TileDB, as dask.delayed
is quite generic and can work with any user-defined task. We just point out here that you can use TileDB array slicing in a delayed task, which allows you to process truly large TileDB arrays on your laptop or on a large cluster.
We include a very simple example below, stressing though that one can implement much more complex algorithms on arbitrarily large TileDB arrays.
TileDB integrates very well with dask.array
. We demonstrate with an example below where attribute attr
stores an int32
value per cell:
You can add any TileDB configuration parameter in storage_options
. Moreover, storage_options
accepts an additional key
option, where you can pass an encryption key if your array is encrypted (see Encryption).
You can also set array chunking similar to Dask's chunking. For example, you can do the following:
You can also write a Dask array into TileDB as follows:
Note that the TileDB array does not need to exist. The above function call will create it if it does not by inferring the schema from the Dask array. To write to an existing array, you should open the array for writing as follows, which will create new fragment(s):
Using an existing Array
object allows extra customization of the array schema beyond what is possible with the automatic array creation shown earlier. For example, to create an array with a compression filter applied to the attribute, create the schema and array first, then write to the open Array
:
The example below demonstrates creation of a TileDB array through Presto. Note that some array schema options are not currently supported from Presto (see Limitations for more details).
<array-uri>
can be any path, local (e.g., file://
) or remote (e.g., s3://
).
You can see the array schema as follows:
A TileDB array created through PrestoDB is and behaves exactly like any other TileDB array. Therefore, it is accessible by all TileDB APIs (e.g., Python) and integrations (e.g., Spark).
PrestoDB can dynamically discover existing TileDB arrays, i.e., even if they were created and populated externally from PrestoDB. Therefore, you can just insert data into a TileDB array or query it as follows:
Presto uses the form of catalog.schema.<array-uri>
for querying. TileDB does not have a concept of a table schema, so any valid string can be used for the schema name when querying and tiledb
is used only for convenience in the examples. <array-uri>
is the array URI and can be local (file://
) or remote (s3://
).
Currently, the TileDB-Presto connector is built as a plugin. It must be packaged and installed on the PrestoDB instances. You can download the latest release or build the connector from source using the following command from the top level directory of the TileDB-Presto repo.
To install the plugin on an existing Presto instance, you need to copy the path/to/TileDB-Presto/target/presto-tiledb-<version>
folder to a tiledb
directory under the plugin directory on echo Presto node. On AWS EMR, this directory is /usr/lib/presto/plugin/tiledb/
.
TileDB-Presto is a data source connector for PrestoDB, which allows you to run SQL queries on TileDB arrays. The connector supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.
The TileDB-Presto connector supports most SQL operations from PrestoDB. Arrays can be referenced dynamically and are not required to be "pre-registered" with Presto. No external service (such as Apache Hive) is required.
A docker image is provided to allow for quick testing of the TileDB-Presto connector. The docker image starts a single-node Presto cluster and opens the CLI Presto interface where SQL can be run. The image includes two example tiledb arrays:
/opt/tiledb_example_arrays/dense_global
(dense array)
/opt/tiledb_example_arrays/sparse_global
(sparse array)
Simply run:
You can run a quick example to see if it works:
It is possible to specify a file that contains SQL to be run from the docker image:
You can also run a SQL statement directly:
You can create a new TileDB array from an existing Spark dataframe as follows. See Driver Options for a summary on the options you can use.
You can write a Spark dataframe to an existing TileDB array by simply adding an "append" mode.
You can read a TileDB array into a Spark dataframe as follows. See Driver Options for a summary on the options you can use.
You can run SQL queries with Spark on TileDB arrays as follows:
Dask is a great library for parallel computing in Python. It can work on your laptop with multiple threads and processes, or on a large cluster. We will take advantage of two very appealing Dask features:
Dynamic task scheduling. We can create arbitrarily complex task graphs using dask.delayed
and let Dask execute them in parallel in our cluster.
Parallel arrays and dataframes. dask.array
and dask.dataframe
work similar to numpy arrays and Pandas dataframes, respectively, but they are extended to work for datasets larger than the main memory and perform computations in a distributed manner by multiple processes and machines.
TileDB currently integrates only with Dask arrays, but we are working on adding support for Dask dataframes. See our roadmap for updates.
Our examples focus only on a single machine, but will work on an arbitrary Dask cluster. Describing how to deploy a Dask cluster though is out of the scope of these docs.
You can install TileDB and Dask as follows:
Below are various examples for querying data with the TileDB Presto connector.
Typical select statements work as expected. This include predicate pushdown for dimension fields.
Select all columns and all data from an array:
Select subset of columns:
Select with predicate pushdown:
Get the query plan without running the query:
Analyze the query but running and profiling:
It is possible to create TileDB array from Presto. Not all array schema options are currently supported from Presto though (see Limitations for more details).
Minimum create table:
Create table with all options specified:
Data can be inserted into TileDB arrays through Presto. Inserts can be from another table or individual values.
Copy data from one table to another:
Data can be inserted using the VALUES
method for single row inserts. This is not recommended because each insert will create a new fragment and cause degraded read performance as the number of fragments increases.
TileDB-Trino is a data source connector for Trino, which allows you to run SQL queries on TileDB arrays. The connector supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.
The TileDB-Trino connector supports most SQL operations from Trino. Arrays can be referenced dynamically and are not required to be "pre-registered" with Trino. No external service (such as Apache Hive) is required.
Below are various examples for querying data with the TileDB Trino connector.
Typical select statements work as expected. This include predicate pushdown for dimension fields.
Select all columns and all data from an array:
Select subset of columns:
Select with predicate pushdown:
Get the query plan without running the query:
Analyze the query but running and profiling:
It is possible to create TileDB array from Trino. Not all array schema options are currently supported from Trino though (see Limitations for more details).
Minimum create table:
Create table with all options specified:
Data can be inserted into TileDB arrays through Trino. Inserts can be from another table or individual values.
Copy data from one table to another:
Data can be inserted using the VALUES
method for single row inserts. This is not recommended because each insert will create a new fragment and cause degraded read performance as the number of fragments increases.
The TileDB connector supports most Presto functionality. Below is a list of the features not currently supported.
The connector does not currently support creating/writing/reading encrypted arrays
The connector does not currently support the TileDB openAt
functionality to open an array at a specific timestamp.
TileDB Presto connector supports the following SQL datatypes:
BOOLEAN
TINYINT
INTEGER
BIGINT
REAL
DOUBLE
DECIMAL (treated as doubles)
STRING*
VARCHAR*
CHAR*
VARBINARY
No other datatypes are supported.
The TileDB Presto connector does not have full support for unsigned values. Presto and all connectors are written in Java, and Java does not have unsigned values. As a result of this Java limitation, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type of long
.
For varchar
, and char
datatypes the special case of char(1)
or varchar(1)
is stored on disk as a fixed-sized attribute of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter but Presto will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
Create table is supported, however only a limited subset of TileDB parameters is supported.
No support for creating encrypted arrays
No support for setting custom filters on attributes, coordinates or offsets
The current split implementation is naive and splits domains evenly with user defined predicates (WHERE
clause) or from the non-empty domains. This even splitting will likely produce sub optimal splits for sparse domains. Future work will move splitting into core TileDB where better heuristics will be used to produce even splits.
For now, if splits are highly uneven consider increasing the number of splits via the tiledb.splits
session parameter or add where clauses to limit the data set to non-empty regions of the array.
Currently, the TileDB-Τrino connector is built as a plugin. It must be packaged and installed on the Trino instances. You can download the latest release or build the connector from source using the following command from the top level directory of the TileDB-Trino repo.
First clone Trino
Install Trino
Create a TileDB directory
Build and copy the TileDB-Trino jars to the TileDB directory
Create two nested directories "etc/catalog" which include the tiledb.properties file and move them to:
Launch the Trino Server
Launch the Trino-CLI with the TileDB plugin
A single configuration file is needed. The config file should be placed in the catalog folder (e.g.,/etc/presto/conf/catalog
on EMR) and named tiledb.properties
.
Sample file contents:
The following parameters can be configured in the tiledb.properties
and are plugin-wide.
These can be set as follows:
Unset session parameters inherit the plugin configuration defaults. The list of session parameters is summarized below"
These are set upon table creation as follows:
These are set upon table creation as follows:
Spark and TileDB have slight variations in their supported datatypes. This table below shows a mapping between the (core) TileDB and Spark datatypes for easy reference.
This document contains all custom SQL options defined by the TileDB Presto connector.
The following properties can be configured for creating a TileDB array in Presto.
Property | Description | Default Value | Possible Values | Required |
---|---|---|---|---|
Property | Description | Default Value | Possible Values | Required |
---|---|---|---|---|
PrestoDB and TileDB have slight differences in their supported datatypes. This document serves as a mapping between the (core) TileDB datatypes and the MariaDB datatypes for easy reference.
Presto and all connectors are written in Java, and Java does not have unsigned values. As a result, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type oflong
.
Special cases ofchar(1)
orvarchar(1)
are stored on disk as fixed-sized attributes of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter, but Presto will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
A single configuration file is needed. The config file should be placed in the catalog folder (e.g.,/etc/trino/conf/catalog
on EMR) and named tiledb.properties
.
Sample file contents:
The following parameters can be configured in the tiledb.properties
and are plugin-wide.
These can be set as follows:
Unset session parameters inherit the plugin configuration defaults. The list of session parameters is summarized below"
These are set upon table creation as follows:
These are set upon table creation as follows:
Trino and TileDB have slight differences in their supported datatypes. This document serves as a mapping between the (core) TileDB datatypes and the MariaDB datatypes for easy reference.
Trino and all connectors are written in Java, and Java does not have unsigned values. As a result, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type oflong
.
Special cases ofchar(1)
orvarchar(1)
are stored on disk as fixed-sized attributes of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter, but Trino will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
This document contains all custom SQL options defined by the TileDB Trino connector.
The following properties can be configured for creating a TileDB array in Trino.
Property | Description | Default Value | Possible Values | Required |
---|---|---|---|---|
Property | Description | Default Value | Possible Values | Required |
---|---|---|---|---|
It is possible to create TileDB array from Trino. Not all array schema options are currently supported from Trino though (see for more details). An example is shown below.
Note that <array-uri>
can be any path, local (e.g., file://
) or remote (e.g., s3://
).
You can see the array schema as follows:
A TileDB array created through Trino is and behaves exactly like any other TileDB array. Therefore, it is accessible by all TileDB APIs (e.g., Python) and integrations (e.g., Spark).
Trino can dynamically discover existing TileDB arrays, i.e., even if they were created and populated externally from Trino. Therefore, you can just insert data into a TileDB array or query it as follows:
Trino uses the form of catalog.schema.<array-uri>
for querying. TileDB does not have a concept of a table schema, so any valid string can be used for the schema name when querying and tiledb
is used only for convenience in the examples. <array-uri>
is the array URI and can be local (file://
) or remote (s3://
).
The TileDB connector supports most of Trino functionality. Below is a list of the features not currently supported.
The connector does not currently support creating/writing/reading encrypted arrays
The connector does not currently support the TileDB openAt
functionality to open an array at a specific timestamp.
TileDB Trino connector supports the following SQL datatypes:
BOOLEAN
TINYINT
INTEGER
BIGINT
REAL
DOUBLE
DECIMAL (treated as doubles)
VARBINARY
No other datatypes are supported.
The TileDB Trino connector does not have full support for unsigned values. Trinno and all connectors are written in Java, and Java does not have unsigned values. As a result of this Java limitation, an unsigned 64-bit integer can overflow if it is larger than 2^63 - 1
. Unsigned integers that are 8, 16 or 32 bits are treated as larger integers. For instance, an unsigned 32-bit value is read into a Java type of long
.
For varchar
, and char
datatypes the special case of char(1)
or varchar(1)
is stored on disk as a fixed-sized attribute of size 1. Any char
/varchar
greater than 1 is stored as a variable-length attribute in TileDB. TileDB will not enforce the length parameter but Trino will for inserts.
Decimal types are currently treated as doubles. TileDB does not enforce the precision or scale of the decimal types.
Create table is supported, however only a limited subset of TileDB parameters is supported.
No support for creating encrypted arrays
No support for setting custom filters on attributes, coordinates or offsets
The current split implementation is naive and splits domains evenly with user defined predicates (WHERE
clause) or from the non-empty domains. This even splitting will likely produce sub optimal splits for sparse domains. Future work will move splitting into core TileDB where better heuristics will be used to produce even splits.
For now, if splits are highly uneven consider increasing the number of splits via the tiledb.splits
session parameter or add where clauses to limit the data set to non-empty regions of the array.
Option
Required
Description
uri
Yes
URI of a TileDB sparse or dense array (required)
tiledb.*
No
Set a TileDB configuration parameter, e.g., option("tiledb.vfs.num_threads", 4)
.
Option
Required
Description
write_buffer_size
No
Set the TileDB read buffer size in bytes per attribute/coordinates. Defaults to 10MB
schema.dim.<D>.name
Yes
Specify which of the Spark dataframe columns will be dimension D
.
schema.dim.<D>.min
No
Specify the lower bound for the TileDB domain on dimension D
.
schema.dim.<D>.max
No
Specify the upper bound for the TileDB domain on dimension D
.
schema.dim.<D>.extent
No
Specify the tile extent on dimension D
.
schema.attr.<A>.filter_list
No
Specify a filter list for attribute A
. The filter list is a list of tuples of the form (name, option)
, ex: "(byteshuffle, -1), (gzip, 9)".
schema.capacity
No
Specify the tile capacity for sparse fragments.
schema.tile_order
No
Specify the tile order.
schema.cell_order
No
Specify the cell order.
schema.coords_filter_list
No
Specify the coordinates filter list. The filter list is a list of tuples of the form (name, option)
, ex: "(byteshuffle, -1), (gzip, 9)".
schema.offsets_filter_list
No
Specify the offsets filter list. The filter list is a list of tuples of the form (name, option)
, ex: "(byteshuffle, -1), (gzip, 9)".
Option
Required
Description
order
No
Result layout order. It can be "row-major"
/ "TILEDB_ROW_MAJOR"
, "col-major"
/ "TILEDB_COL_MAJOR"
, "global-order"
/ "TILEDB_GLOBAL_ORDER"
, or "unordered"
/ "TILEDB_UNORDERED"
(default "unordered"
).
read_buffer_size
No
Set the TileDB read buffer size in bytes per attribute/coordinates. Defaults to 10MB
allow_read_buffer_realloc
No
If the read buffer size is too small allow reallocation. Default: True
partition_count
No
Number of partitions.
Parameter
Default
Datatype
Description
array-uris
""
String
CSV list of arrays to preload metadata on
read-buffer-size
10485760
Integer
Max read buffer size per attribute
write-buffer-size
10485760
Integer
Max write buffer size per attribute
aws-access-key-id
""
String
AWS_ACCESS_KEY_ID
for S3 access
aws-secret-access-key
""
String
AWS_SECRET_ACCESS_KEY
for S3 access
tiledb-config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Default
Datatype
Description
read_buffer_size
Plugin
Integer
Max read buffer size per attribute
write_buffer_size
Plugin
Integer
Max write buffer size per attribute
aws_access_key_id
Plugin
String
AWS_ACCESS_KEY_ID
for S3 access
aws_secret_access_key
Plugin
String
AWS_SECRET_ACCESS_KEY
for S3 access
splits
-1
Integer
Number of splits to use per query, -1 means splits will be equal to number of workers
split_only_predicates
false
Boolean
Split only based on predicates pushed down from where clause
enable_stats
false
Boolean
Enable collecting and dumping connector stats to Presto log
tiledb_config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Description
Default
Possible Values
Required
uri
Array URI
""
*
Yes
type
Array type
SPARSE
SPARSE
, DENSE
No
cell_order
Cell order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
tile_order
Tile order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
capacity
Tile capacity
10000L
>0
No
Name
Description
Default
Possible Values
Required
dimension
Column is a dimension
False
True, False
No
lower_bound
Domain lower bound
0L
Any Long Value
No
upper_bound
Domain upper bound
Long.MAX_VALUE
Any Long Value
No
extent
Tile extent
10L
Any Long Value
No
TileDB DataType
Spark SQL Datatype
TILEDB_INT8
BYTE
TILEDB_UINT8
SHORT
TILEDB_INT16
SHORT
TILEDB_UINT16
INTEGER
TILEDB_INT32
INTEGER
TILEDB_UINT32
LONG
TILEDB_INT64
LONG
TILEDB_UINT64
LONG
TILEDB_FLOAT32
FLOAT
TILEDB_FLOAT64
DOUBLE
TILEDB_DATETIME_DAY
DATE
TILEDB_DATETIME_MS
TIMESTAMP
uri
URI for array to be created at
""
*
Yes
type
Array Type
SPARSE
SPARSE, DENSE
No
cell_order
Cell order for array
ROW_MAJOR
ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER
No
tile_order
Tile order for array
ROW_MAJOR
ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER
No
capacity
Capacity of sparse array
10000L
>0
No
dimension
Is column a dimension
False
True, False
No
lower_bound
Domain Lower Bound
0L
Any Long Value
No
upper_bound
Domain Upper Bound
Long.MAX_VALUE
Any Long Value
No
extent
Dimension Extent
10L
Any Long Value
No
TileDB Datatype
PrestoDB SQL Datatype
TILEDB_INT8
BOOLEAN
TILEDB_INT16
TINYINT
TILEDB_INT32
INTEGER
TILEDB_INT64
BIGINT
TILEDB_FLOAT64
REAL
TILEDB_FLOAT64
DOUBLE
TILEDB_FLOAT64
DECIMAL
(treated as DOUBLE
)
TILEDB_CHAR
(var)
STRING
TILEDB_CHAR
(var)
VARCHAR
TILEDB_CHAR
(var)
CHAR
TILEDB_CHAR
(var)
VARBINARY
Parameter
Default
Datatype
Description
array-uris
""
String
CSV list of arrays to preload metadata on
read-buffer-size
10485760
Integer
Max read buffer size per attribute
write-buffer-size
10485760
Integer
Max write buffer size per attribute
aws-access-key-id
""
String
AWS_ACCESS_KEY_ID
for S3 access
aws-secret-access-key
""
String
AWS_SECRET_ACCESS_KEY
for S3 access
tiledb-config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Default
Datatype
Description
read_buffer_size
Plugin
Integer
Max read buffer size per attribute
write_buffer_size
Plugin
Integer
Max write buffer size per attribute
aws_access_key_id
Plugin
String
AWS_ACCESS_KEY_ID
for S3 access
aws_secret_access_key
Plugin
String
AWS_SECRET_ACCESS_KEY
for S3 access
splits
-1
Integer
Number of splits to use per query, -1 means splits will be equal to number of workers
split_only_predicates
false
Boolean
Split only based on predicates pushed down from where clause
enable_stats
false
Boolean
Enable collecting and dumping connector stats to Trino log
tiledb_config
""
String
TileDB config parameters in key1=value1,key2=value2 form
Name
Description
Default
Possible Values
Required
uri
Array URI
""
*
Yes
type
Array type
SPARSE
SPARSE
, DENSE
No
cell_order
Cell order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
tile_order
Tile order
ROW_MAJOR
ROW_MAJOR
, COL_MAJOR
No
capacity
Tile capacity
10000L
>0
No
Name
Description
Default
Possible Values
Required
dimension
Column is a dimension
False
True, False
No
lower_bound
Domain lower bound
0L
Any Long Value
No
upper_bound
Domain upper bound
Long.MAX_VALUE
Any Long Value
No
extent
Tile extent
10L
Any Long Value
No
TileDB Datatype
Trino SQL Datatype
TILEDB_INT8
BOOLEAN
TILEDB_INT16
TINYINT
TILEDB_INT32
INTEGER
TILEDB_INT64
BIGINT
TILEDB_FLOAT64
REAL
TILEDB_FLOAT64
DOUBLE
TILEDB_FLOAT64
DECIMAL
(treated as DOUBLE
)
TILEDB_CHAR
(var)
STRING
TILEDB_CHAR
(var)
VARCHAR
TILEDB_CHAR
(var)
CHAR
TILEDB_CHAR
(var)
VARBINARY
uri
URI for array to be created at
""
*
Yes
type
Array Type
SPARSE
SPARSE, DENSE
No
cell_order
Cell order for array
ROW_MAJOR
ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER
No
tile_order
Tile order for array
ROW_MAJOR
ROW_MAJOR, COL_MAJOR, GLOBAL_ORDER
No
capacity
Capacity of sparse array
10000L
>0
No
dimension
Is column a dimension
False
True, False
No
lower_bound
Domain Lower Bound
0L
Any Long Value
No
upper_bound
Domain Upper Bound
Long.MAX_VALUE
Any Long Value
No
extent
Dimension Extent
10L
Any Long Value
No