Create Dataframe From CSV

TileDB currently offers CSV functionality only in Python.

TileDB-Py supports function from_csv that can perform three tasks:

  1. schema_only

  2. ingest (default)

  3. append

In this section, we will cover the schema_only mode, which allows solely the creation of the array schema for a dataframe stored in a CSV file, without ingesting any rows / cells in it.

Dataframe as a Dense Array

You can create a dataframe as a 1D dense array from the data stored in a CSV file as follows:

tiledb.from_csv("my_array", "data.csv", mode="schema_only")

You can see the schema as follows:

A = tiledb.open("my_array", mode="r")
A.schema 

Modeling dataframes as dense arrays is useful if there is no particular column that receives conditions in most query workloads. The underlying 1D dense array enables rapid slicing of any set of contiguous rows.

You can set the tile extent of the 1D array upon creation as follows:

tiledb.from_csv("my_array", "data.csv", tile=1000000)

Dataframe as a Sparse Array

You can create a dataframe as a ND sparse array by setting sparse=True and selecting any subset of columns as dimensions in the index_dims parameter.

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                sparse=True, 
                index_dims=['col3'])

The resulting array can have any number of dimensions. You can see the schema as follows:

A = tiledb.open("taxi_array", mode="r")
A.schema

You can set other parameters upon creation, such as cell_order and tile_order, the tile capacity and whether the array allows_duplicates. For example:

tiledb.from_csv("my_array", 
                "data.csv",
                mode="schema_only",
                sparse=True, 
                index_dims=['col3'],
                cell_order='hilbert'
                capacity=100000,
                allows_duplicates=False)

Parse Dates

If you wish to parse and store certain columns as datetime types, you can specify it using parse_dates as follows (applicable to both dense and sparse arrays):

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                parse_dates=['col1', 'col2'])

Add Filters

You can add filters (e.g., compression) for dimensions and attributes using dim_filters and attr_filters, and you can also add filters for the offsets of variable-length attributes using filter_offsets (applicable to both dense and sparse arrays).

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                filter_offsets=tiledb.FilterList([tiledb.GzipFilter(level=-1)]),        # all var-length attribute offsets will get this filter
                dim_filters=tiledb.FilterList([tiledb.ZstdFilter(level=-1)]),           # all dims will get this filter
                attr_filters={'col1': tiledb.FilterList([tiledb.GzipFilter(level=-1)])} # you can set a different filter per attribute (applies also to dimensions)                

Function from_csv does not support encryption. Let us know if this is important to you and we will escalate its development.

Set Column Types

Command from_csv uses pandas.read_csv under the covers to parse the CSV file and infer the data types of the columns. You can bypass pandas type inference and force the types of the columns using column_types(applicable to both dense and sparse dataframes):

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                column_types={"col1" : np.int32, "col2": np.float64})

Set Nullable Attributes

You can set an attribute as nullable by casting to a pandas datatype that can handle NA values. Applicable data types include: pandas.StringDType(), pandas.Int8DType(), pandas.Int16DType(), pandas.Int32DType(), pandas.Int64DType(), pandas.UInt8DType(), pandas.UInt16DType(), pandas.UInt32DType(), pandas.UInt64DType(), pandas.BooleanDType() .

tiledb.from_csv("my_array", 
                "data.csv", 
                mode="schema_only",
                capacity=100000, 
                sparse=True, 
                index_dims=['col3'], 
                dtype={"col1" : pandas.StringDtype()})

Essentially dtype above creates a pandas dataframe setting the col1 datatype to a string type that handles missing values, which TileDB picks up and defines col1 as a nullable attribute. See how pandas handles missing values for more information.

Create From / To Cloud Storage

From Cloud Storage

You can create a dataframe from a CSV file stored in cloud storage (e.g., AWS S3), without the need to download the file on your local machine. It is as simple as providing the cloud URI for the CSV file:

tiledb.from_csv("my_array", "s3://my_bucket/data.csv", mode="schema_only")

For this to work, you will need to store your cloud credentials in the appropriate environment variables, for example (for AWS):

os.environ["AWS_ACCESS_KEY_ID"] = # ADD KEY ID
os.environ["AWS_SECRET_ACCESS_KEY"] = # ADD SECRET KEY
os.environ["AWS_DEFAULT_REGION"] = # ADD REGION

Alternatively, you can create a VFS object, properly configured with the cloud credentials, to from_pandas as follows (for AWS):

# Create a properly configured VFS object
cfg = tiledb.Config()
cfg["vfs.s3.aws_access_key_id"] = # ADD KEY ID
cfg["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY
cfg["vfs.s3.region"] = # ADD REGION
vfs = tiledb.VFS(config=cfg)

# Open the CSV file with `vfs`
csv_file = vfs.open("s3://my_bucket/data.csv")

# Invoke `from_csv` passing `csv_file`
tiledb.from_csv("my_array", csv_file, mode="schema_only")

To Cloud Storage

Similarly, you can ingest into an array on cloud storage as well, creating a context configured with the cloud credentials and passing it to from_csv. An example for AWS is shown below:

# Create a properly configured Ctx object
cfg = tiledb.Config()
cfg["vfs.s3.aws_access_key_id"] = # ADD KEY ID
cfg["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY
cfg["vfs.s3.region"] = # ADD REGION
ctx = tiledb.Ctx(cfg)

# Invoke `from_csv` passing `ctx`
tiledb.from_csv("s3://my_bucket/my_array", "data.csv", ctx=ctx)

From and To Cloud Storage

The above can be combined if both the input CSV and output array reside on cloud storage. An example for AWS is shown below:

# Create a properly configured VFS object for the CSV file
cfg_csv = tiledb.Config()
cfg_csv["vfs.s3.aws_access_key_id"] = # ADD KEY ID FOR THE CSV FILE
cfg_csv["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY FOR THE CSV FILE
cfg_csv["vfs.s3.region"] = # ADD REGION FOR THE CSV FILE
vfs = tiledb.VFS(config=cfg_csv)

# Open the CSV file with `vfs`
csv_file = vfs.open("s3://my_bucket/data.csv")

# Create a properly configured Ctx object for the array
cfg_array = tiledb.Config()
cfg_array["vfs.s3.aws_access_key_id"] = # ADD KEY ID FOR THE ARRAY
cfg_array["vfs.s3.aws_secret_access_key"] = # ADD SECRET KEY FOR THE ARRAY
cfg_array["vfs.s3.region"] = # ADD REGION FOR THE ARRAY
ctx_array = tiledb.Ctx(cfg_array)

# Invoke `from_csv` passing `csv_file` and `ctx_array`
tiledb.from_csv("my_array", csv_file, mode="schema_only", ctx=ctx_array)

Last updated