TileDB currently offers CSV functionality only in Python.
TileDB-Py supports function from_csv
that can perform three tasks:
schema_only
ingest (default)
append
In this section, we will cover the schema_only
mode, which allows solely the creation of the array schema for a dataframe stored in a CSV file, without ingesting any rows / cells in it.
You can create a dataframe as a 1D dense array from the data stored in a CSV file as follows:
You can see the schema as follows:
Modeling dataframes as dense arrays is useful if there is no particular column that receives conditions in most query workloads. The underlying 1D dense array enables rapid slicing of any set of contiguous rows.
You can set the tile extent of the 1D array upon creation as follows:
You can create a dataframe as a ND sparse array by setting sparse=True
and selecting any subset of columns as dimensions in the index_dims
parameter.
The resulting array can have any number of dimensions. You can see the schema as follows:
You can set other parameters upon creation, such as cell_order
and tile_order
, the tile capacity
and whether the array allows_duplicates
. For example:
If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates
as follows (applicable to both dense and sparse arrays):
You can add filters (e.g., compression) for dimensions and attributes using dim_filters
and attr_filters
, and you can also add filters for the offsets of variable-length attributes using filter_offsets
(applicable to both dense and sparse arrays).
Function from_csv
does not support encryption. Let us know if this is important to you and we will escalate its development.
Command from_csv
uses pandas.read_csv
under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types
(applicable to both dense and sparse dataframes):
You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType()
, pandas.Int8DType()
, pandas.Int16DType()
, pandas.Int32DType()
, pandas.Int64DType()
, pandas.UInt8DType()
, pandas.UInt16DType()
, pandas.UInt32DType()
, pandas.UInt64DType()
, pandas.BooleanDType()
.
Essentially dtype
above creates a pandas dataframe setting the col1
datatype to a string type that handles missing values, which TileDB picks up and defines col1
as a nullable attribute. See how pandas handles missing values for more information.
You can create a dataframe from a CSV file stored in cloud storage (e.g., AWS S3), without the need to download the file on your local machine. It is as simple as providing the cloud URI for the CSV file:
For this to work, you will need to store your cloud credentials in the appropriate environment variables, for example (for AWS):
Alternatively, you can create a VFS object, properly configured with the cloud credentials, to from_pandas
as follows (for AWS):
Similarly, you can ingest into an array on cloud storage as well, creating a context configured with the cloud credentials and passing it to from_csv
. An example for AWS is shown below:
The above can be combined if both the input CSV and output array reside on cloud storage. An example for AWS is shown below: