Array storage

File format

All array elements are stored as Parquet files. Parquet supports a large number of environments, built-in compression, self-documentation through embedded metadata and the ability to read/write files in chunks.

Parquet file writing options

You are required to use these options when writing Parquet files:

Format version: 2.4
Flavor: default (not set)
Data page size: default (1MB)
Compression: gzip
Encryption: none

Dictionary encoding

The use of dictionary encoding is optional and depends on the type of data. You should evaluate the use of dictionary encoding for each dataset.

Row groups

Tables should be written in row groups of reasonable size. The size of these groups should be small enough to allow access to single rows with minimal overhead while also keeping the overhead generated by using row-groups to a minimum.

Ordered data

Generalised column-major order must be used when flattening multidimensional array data. This means that the first spatial dimension (x for an unrotated grid) is contiguous.

Data types

The schema elements define logical datatypes to be used.

Data type	Notes
Floating point values	float64 (double)
Integer values	int64, primarily used as indices, indices are always 0-based
Strings	UTF-8 encoded
Timestamps	stored in micro-seconds (unit MICROS) and normalized to UTC
Colors	stored as uint32 in ABGR32 order with 8 bits per channel

Nullable data

Schemas elements stored as Parquet can be nullable or not.

Generally, measured data may have null values. Attributes may have null values. Data used for geometry description (vertices, meshes, lines, points and so forth) should not have null values.

If the scenario is ambiguous, the element description should indicate if data column is nullable.

Column names

Schemas referring to a Parquet array with more than one column define the required order of those columns. A statement such as "Columns: x, y, z" means that the Parquet file is expected to have 3 columns, where the first column represents the X values, the second columns represents the Y values and the third column represents the Z values. You are allowed to use arbitrary column names and no assumptions about the column name should be made when consuming an object.

Hash vs UUID

Blobs can be identified by their hash or by a UUID. Using a hash is the preferred option.

Hash

The SHA-256 algorithm is used to calculate the hash of the blob (the final Parquet file). The hex-digest of the hash (64 characters) is the identifier passed to the service. This allows automatic deduplication for all scenarios where it is feasible to produce the entire blob prior to uploading.

UUID

A UUID can be used if you need to upload the blob in chunks. In this case, it is impossible to calculate the hash prior to uploading the data. This does not allow automatic deduplication and should only be used if calculating the hash is not possible.

File format​

Parquet file writing options​

Dictionary encoding​

Row groups​

Ordered data​

Data types​

Nullable data​

Column names​

Hash vs UUID​

Hash​

UUID​