Conventions
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Versioning
This specification utilises semantic versioning.
Before version 1.0.0, all minor changes may be break compatibility.
Otherwise,
- Patch versions are used for non-substantive or bug-fix changes to the text of the specification.
- Minor versions are used for additions which do not break backward compatibility, i.e.
- A v1.2 parser SHOULD be able to read all v1.1 files
- A v1.1 parser MAY be able to partially read v1.2 files
- Major versions are used for changes which break compatibility, i.e.
- A v1.x parser MAY not be able to read a v2.y file
- A v2.x parser MAY not be able to read a v1.y file
Extensions MAY use other versioning schemes.
All version strings MUST conform to the PEP-440 version specifier specification.
Naming
Unless there is a very good reason not to, fields and metadata keys SHOULD use snake_case names.
Data types
Primitives
This specification uses arrow’s nomenclature for primitives. Note that strings MUST be in UTF-8 encoding.
Real values SHOULD be stored as a float64 field, or as a decimal string (e.g. 3.14) in a metadata value.
Integer IDs SHOULD be stored as uint64 field, or as a decimal string (e.g. 123) in a custom metadata value.
Attributes
Here, attributes are defined as arbitrary unstructured data and metadata.
Consider defining an extension to make your additional data discoverable and re-usable.
Attribute arrow metadata
Schema metadata and field metadata MAY contain unstructured arbitrary attributes,
whose keys MUST be prefixed by attr:.
Nested attributes MAY be encoded with :-separated elements
(e.g. attr:parent_container:child_field),
although storing a structure in a serialised form like JSON is also acceptable.
Attribute fields
Arbitrary attribute fields MAY be added to any schema.
The name of the field MUST be prefixed by attr:.
Additionally, all schemas MAY have an attr field,
which MUST be a nullable map from variable-length string keys to variable-length bytes values.
These keys SHOULD NOT have an attr: prefix.
Neurarrow-specific metadata
Contexts
A single logical dataset may span multiple on-disk tables,
either due to partitioning of a single logical table,
or where tables of multiple types refer to each other.
Different datasets may repeat IDs (e.g. for low integers).
The context identifier is an arbitrary UTF-8 string identifying a shared context in which all IDs MUST be unique;
strictly it is the (context, ID) pair which is globally unique.
It is RECOMMENDED that the identifier be an IRI or (hex-encoded) UUID to ensure uniqueness.
Spaces
Samples taken from two spatial experiments MAY have the same coordinates, but these may not actually be in the same physical location. Spatial datasets are routinely transformed from one “space” to another for comparison. It is important to track which space a dataset belongs to, to know whether it can be compared to another.
The space identifier is an arbitrary UTF-8 string identifying the space from which spatial data are taken (e.g. animals, transforms). Data sets from different spaces SHOULD NOT be spatially overlaid without transformation.
It is RECOMMENDED that the identifier be an IRI or UUID to ensure uniqueness.
Data from two contexts MAY share the same space (e.g. after transforming one experiment’s data to another’s space). Data from one context MUST share a space.
Schemas
Schemas have metadata, containing:
- required keys which MUST exist
- optional keys which MAY exist
- attribute keys
- extension keys
Metadata keys and values MUST be strings, but MAY encode non-string data (e.g. numbers in decimal representation).
Schema metadata MAY contain arbitrary attributes;
their keys MUST be prefixed by attr: as described in the Attributes section.
Schema metadata MAY contain extension metadata; their keys MUST be prefixed by the unique name of the extension as described in Extensions.
Schema fields
Schemas have fields (a.k.a. columns) described in this specification:
- required fields which MUST exist
- optional fields which MAY exist
- derived fields which MAY exist, but MUST be calculable from other fields in the same context
- derived fields MAY be invalidated if the fields they depend on are updated
- extension fields whose name MUST be prefixed by the extension’s unique name as described in Extensions
- attribute fields which MAY exist and whose name MUST be prefixed with
attr:as described in Attributes- the
attrfield described in Attributes is also an attribute field
- the
In Arrow, fields can have metadata. This feature is currently unused by neurarrow, but MAY be in future. Writers and extensions SHOULD NOT add or rely on field metadata.
Inheritance
Certain (“child”) schemas inherit from another (“parent”) schema. This means that the child schema:
- MUST have all of the parents’ required fields and metadata
- MAY have all of the parents’ optional and derived fields and metadata
Child schemas MAY inherit from more than one parent schema.
The Base and Spatial schemas are provided as abstract schemas. They SHOULD NOT be written as files themselves, but define a parent schema other (concrete) schemas MAY inherit from.
Storage
Individual tables SHOULD be stored as:
- Arrow IPC File format
- Also known as Feather format
- Best for inter-process communication or memory mapped I/O applications
- Extension
.arrow
- Parquet
- Best for longer-term, potentially more space-efficient storage
- Extension
.parquet
- Hive partitioned parquet
- Parquet files split into chunks in directories based on a particular column’s value
- Best for very large datasets, particularly in high-latency environments like cloud storage
- Paths like
.../fragment_id=123/*.skeleton.parquet - Note that modifying schema and field metadata on hive-partitioned data can mean updating a lot of files
The extension SHOULD be prefixed with the name of the file schema, like brain.skeletons.parquet or cns.connections.feather.
WARNING
Parquet does not directly support
uint64data (used for IDs in neurarrow). Largeuint64values are usually mapped to the negative range of parquet’sint64data type, and then parsed back touint64when read back into arrow.