Byte Format

With an understanding of the data model, there are a few things to note about the byte format (i.e. persisted format) of the values for some attributes.

Variant Data Attributes

Wherever possible, the attributes in the variant data array are stored in a straightforward way. Several attributes have a specific format for the cell values stored within, which are detailed below.

alleles

This attribute stores a null-terminated CSV string of the REF and ALT alleles for every cell.

For example, the VCF record:

1 69762 . T <NON_REF> . . END=69770 GT:DP:GQ:MIN_DP:PL 0/0:23:60:23:0,60,900

Would contain the ASCII bytes T , < N O N _ R E F > \0 in the corresponding cell's alleles attribute. This is stored simply as a variable-length char attribute. The null terminating character is included to allow for easy reconstruction of the VCF record during export (htslib requires the null terminator).

filter_ids

This attribute stores the list of filter IDs for the VCF record. Because there can be a variable-length (or empty) list of filters per record, the attribute is stored with the format:

<nfilters> [<filter_id_0> <filter_id_1> ...]

Each value above (nfilters as well as the filter IDs themselves) are int32_t values. This is stored as a variable-length int32_t TileDB attribute.

During export, the VCF header for the sample is used for converting integer filter IDs back to their string representation.

info_* / fmt_*

These attributes store specific INFO or FMT field values from VCF records. In either case, the value for each cell is stored with the format:

<type> <nvals> <val0> [<val1> ...]

The <type> value is stored as an int32_t, and is the htslib datatype value for the following field values. The <nvals> value is also an int32_t, and stores the number of following field values. The <valN> values the typed values of the field.

For example, the VCF record from earlier:

1 69762 . T <NON_REF> . . END=69770 GT:DP:GQ:MIN_DP:PL 0/0:23:60:23:0,60,900

If the fmt_PL field was stored as an explicit array attribute, it would store the following values for the cell's fmt_PL attribute:

1 3 0 60 900

Because the PL field is an integer value as specified in the VCF header, the first 1 value (int32_t) corresponds to the value of htslib's BCF_HT_INT. The second int32_t value 3 indicates that following are three values (which are known to be integers). The subsequent 0, 60, 900 values are the actual field values, which in this case are also all int32_t values.

Although in this example all the values stored in the fmt_PL attribute are int32_t values, this is not always true in general. Thus, the TileDB datatype for info_* / fmt_* attributes is var<uint8_t>, and so above list of values is accessed as a sequence of bytes, storing the little-endian int32_t values.

info / fmt

Remaining INFO or FMT fields that are not extracted as explicit attributes are instead stored in the info or fmt attributes.

The byte format is mostly the same as when the field values are stored in the explicit attributes. However, because each cell may have multiple info or fmt fields stored in these generic attributes, we must store the null-terminated name of the field as well.

For the same VCF record example:

1 69762 . T <NON_REF> . . END=69770 GT:DP:GQ:MIN_DP:PL 0/0:23:60:23:0,60,900

Suppose that fmt_GT, fmt_PL, and fmt_MIN_DP were stored as explicit attributes, leaving DP and GQ. The fmt attribute value for this cell would store the following bytes:

D P 0 1 0 0 0 1 0 0 0 23 0 0 0 G Q 0 1 0 0 0 1 0 0 0 60 0 0 0 0

Note the above is a byte listing and, therefore, the int32_t values are listed as four bytes in little-endian order.