Introduction
Specifications for storing and transmitting neuronal morphology and connectivity data using the Apache Arrow data model, and models compatible with it (e.g. Apache Parquet).
About Apache Arrow
From arrow.apache.org:
Apache Arrow defines a language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
Using Apache Arrow gives neurarrow implementors access to a large ecosystem of existing software libraries across languages, as well as the ability to exchange that data between language runtimes and processes with minimal serialisation cost.
The use of standard binary formats such as parquet also allows the data to be read now and in the future without neurarrow-specific implementations.
Prior art
These software packages manage tabular neuroscience data:
- navis
- The neurarrow specification was originally based on navis’ parquet IO
- CATMAID
- natverse
These file formats describe tabular neuroscience data:
These specifications build on Apache Arrow with domain-specific schemas:
- geoarrow and geoparquet
Links
Development happens on github, and the rendered specification is at https://clbarnes.github.io/neurarrow/.
Conventions
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Versioning
This specification utilises semantic versioning.
Before version 1.0.0, all minor changes may be break compatibility.
Otherwise,
- Patch versions are used for non-substantive or bug-fix changes to the text of the specification.
- Minor versions are used for additions which do not break backward compatibility, i.e.
- A v1.2 parser SHOULD be able to read all v1.1 files
- A v1.1 parser MAY be able to partially read v1.2 files
- Major versions are used for changes which break compatibility, i.e.
- A v1.x parser MAY not be able to read a v2.y file
- A v2.x parser MAY not be able to read a v1.y file
Extensions MAY use other versioning schemes.
All version strings MUST conform to the PEP-440 version specifier specification.
Naming
Unless there is a very good reason not to, fields and metadata keys SHOULD use snake_case names.
Data types
Primitives
This specification uses arrow’s nomenclature for primitives. Note that strings MUST be in UTF-8 encoding.
Real values SHOULD be stored as a float64 field, or as a decimal string (e.g. 3.14) in a metadata value.
Integer IDs SHOULD be stored as uint64 field, or as a decimal string (e.g. 123) in a custom metadata value.
Attributes
Here, attributes are defined as arbitrary unstructured data and metadata.
Consider defining an extension to make your additional data discoverable and re-usable.
Attribute arrow metadata
Schema metadata and field metadata MAY contain unstructured arbitrary attributes,
whose keys MUST be prefixed by attr:.
Nested attributes MAY be encoded with :-separated elements
(e.g. attr:parent_container:child_field),
although storing a structure in a serialised form like JSON is also acceptable.
Attribute fields
Arbitrary attribute fields MAY be added to any schema.
The name of the field MUST be prefixed by attr:.
Additionally, all schemas MAY have an attr field,
which MUST be a nullable map from variable-length string keys to variable-length bytes values.
These keys SHOULD NOT have an attr: prefix.
Neurarrow-specific metadata
Contexts
A single logical dataset may span multiple on-disk tables,
either due to partitioning of a single logical table,
or where tables of multiple types refer to each other.
Different datasets may repeat IDs (e.g. for low integers).
The context identifier is an arbitrary UTF-8 string identifying a shared context in which all IDs MUST be unique;
strictly it is the (context, ID) pair which is globally unique.
It is RECOMMENDED that the identifier be an IRI or (hex-encoded) UUID to ensure uniqueness.
Spaces
Samples taken from two spatial experiments MAY have the same coordinates, but these may not actually be in the same physical location. Spatial datasets are routinely transformed from one “space” to another for comparison. It is important to track which space a dataset belongs to, to know whether it can be compared to another.
The space identifier is an arbitrary UTF-8 string identifying the space from which spatial data are taken (e.g. animals, transforms). Data sets from different spaces SHOULD NOT be spatially overlaid without transformation.
It is RECOMMENDED that the identifier be an IRI or UUID to ensure uniqueness.
Data from two contexts MAY share the same space (e.g. after transforming one experiment’s data to another’s space). Data from one context MUST share a space.
Schemas
Schemas have metadata, containing:
- required keys which MUST exist
- optional keys which MAY exist
- attribute keys
- extension keys
Metadata keys and values MUST be strings, but MAY encode non-string data (e.g. numbers in decimal representation).
Schema metadata MAY contain arbitrary attributes;
their keys MUST be prefixed by attr: as described in the Attributes section.
Schema metadata MAY contain extension metadata; their keys MUST be prefixed by the unique name of the extension as described in Extensions.
Schema fields
Schemas have fields (a.k.a. columns) described in this specification:
- required fields which MUST exist
- optional fields which MAY exist
- derived fields which MAY exist, but MUST be calculable from other fields in the same context
- derived fields MAY be invalidated if the fields they depend on are updated
- extension fields whose name MUST be prefixed by the extension’s unique name as described in Extensions
- attribute fields which MAY exist and whose name MUST be prefixed with
attr:as described in Attributes- the
attrfield described in Attributes is also an attribute field
- the
In Arrow, fields can have metadata. This feature is currently unused by neurarrow, but MAY be in future. Writers and extensions SHOULD NOT add or rely on field metadata.
Inheritance
Certain (“child”) schemas inherit from another (“parent”) schema. This means that the child schema:
- MUST have all of the parents’ required fields and metadata
- MAY have all of the parents’ optional and derived fields and metadata
Child schemas MAY inherit from more than one parent schema.
The Base and Spatial schemas are provided as abstract schemas. They SHOULD NOT be written as files themselves, but define a parent schema other (concrete) schemas MAY inherit from.
Storage
Individual tables SHOULD be stored as:
- Arrow IPC File format
- Also known as Feather format
- Best for inter-process communication or memory mapped I/O applications
- Extension
.arrow
- Parquet
- Best for longer-term, potentially more space-efficient storage
- Extension
.parquet
- Hive partitioned parquet
- Parquet files split into chunks in directories based on a particular column’s value
- Best for very large datasets, particularly in high-latency environments like cloud storage
- Paths like
.../fragment_id=123/*.skeleton.parquet - Note that modifying schema and field metadata on hive-partitioned data can mean updating a lot of files
The extension SHOULD be prefixed with the name of the file schema, like brain.skeletons.parquet or cns.connections.feather.
WARNING
Parquet does not directly support
uint64data (used for IDs in neurarrow). Largeuint64values are usually mapped to the negative range of parquet’sint64data type, and then parsed back touint64when read back into arrow.
Implementations
Software implementing neurarrow should be listed here.
Software libraries
- neurarrow.py
- simple python/ pyarrow validator for neurarrow data
- swc2na
- rust crate for reading SWC files into neurarrow skeletons
- uses the
net.clbarnes.swcextension
User-facing tools
- swc2na
- CLI for converting SWC files into neurarrow feather and parquet files
- uses the
net.clbarnes.swcextension
Extensions
Neurarrow is designed to be extensible to support different use cases.
Naming extensions
Extensions MUST have unique names,
and SHOULD ensure this by incorporating the web domain of the controlling entity in reverse DNS format,
like com.example.my_extension.
Extension metadata
All metadata keys associated with an extension MUST be prefixed with the unique name of the extension and a colon,
like com.example.my_extension:my_key.
Extensions MAY add any number of required or optional keys.
All extensions MUST define a version metadata key,
whose value MUST be a string conforming to the PEP-440 specification.
Extension authors MAY use whatever versioning scheme they prefer
(e.g. semantic, effort, calendar)
and document it separately.
The extension version is required in all files using the extension, so that readers can check whether an extension is in use by checking for the existence of this key.
Extension authors SHOULD document all required and optional metadata keys and the format of their values.
Extension fields
Extensions MAY add new fields to any schema.
These MUST have names prefixed by the extension name and a colon
(com.example.my_extension:my_field).
Extension authors SHOULD document all required, optional, and derived fields.
Extension example
This section is not normative.
- The owner of
https://example.comdevelops an extension to represent spatially transformed skeletons - Their extension is named
com.example.transform - The schema metadata of tables using this extension includes
com.example.transform:versionandcom.example.transform:uuid - The skeletons schema now includes
com.example.transform:original_xyz, astruct{x:float64, y:float64, z:float64}field
Extension by inheritance
This section is not normative.
Rather than extending an existing schema, new schemas MAY be created by which inherit from another. This is NOT RECOMMENDED, unless a new type of data is being represented. Data from extension A and extension B can both exist in the same table if they extend the same schema, but child schema X and child schema Y can only be composed by creating a third child schema Z which inherits from both.
For example, meshes can be represented as a table of vertices and a table of polygons referencing those vertices. The vertex table can re-use the point cloud schema, possibly with an extension encoding features like vertex normals. The polygon table could then create a new schema inheriting from base because there is no similar schema already.
Developing extensions
This section is non-normative.
Extensions should have a single concern, to facilitate modularity and re-use. For example, rather than defining a single general “metadata we like to keep track of in our lab” extension, with multiple fields for different purposes, consider multiple extensions.
Your extension documentation should list
- which schemas it affects
- any added fields
- include their data types
- include whether they are required by the extension, optional
- if they are derived, include how to calculate them
- any added schema or field metadata keys
- include how to interpret their value
- which version(s) of neurarrow it targets
- whether it depends on any other extensions
- if so, which version(s)
- any known implementations
Ideally, extension documentation should be based on a public version-controlled repository (e.g. codeberg, gitlab, github), and listed below. Raise an issue, submit a PR, or contact the neurarrow developers to get your extension listed.
Known extensions
If you publish a neurarrow extension, please list it here.
| Name | Description | Status | URL |
|---|---|---|---|
net.clbarnes.swc | Interoperability with the SWC skeleton format | Experimental | https://github.com/clbarnes/neurarrow-ext/blob/main/extensions/swc.md |
Wishlist
The below are examples of tabular neuromorphology data which could be defined as extensions of neurarrow.
- navis uses a SWC-like pandas dataframe
for its TreeNeuron morphology, and more dataframes for connectivity and Dotprops
- standardise schema metadata fields
- add derived fields for caching purposes
- navis uses a connector table which has its own location; this could be an extension of the connections schema or a new schema
Changelog
All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased
0.2.1 - 2026-02-05
Major build process overhaul, minor corrections
0.2.0 - 2026-02-02
Added
- Extension scheme
attrfieldscontextmetadata- Connections schema
- Inheritance concept
Removed
- Connectors schema
- Connectors-related fields in skeletons schema
labelsfield in skeletons schema- these can be re-added by an extension
0.1.0 - 2024-01-12
Initial implementation.
- Skeletons
- Dotprops
- Connectors
Base (abstract)
This abstract schema defines metadata and fields available to all neurarrow tables.
Parent schemas
None
Schema metadata
These metadata keys are defined in addition to those defined by any parent schemas.
Required schema metadata
These keys MUST exist in the schema’s metadata.
version
- encoding: UTF-8 string
The version of the neurarrow specification used to write the data.
context
- encoding: UTF-8 string
The identifier for the context shared by all elements in this file, and possibly other files too; see Contexts.
Optional schema metadata
These keys MAY exist in the schema’s metadata.
attr:*
- encoding: various
Unstructured arbitrary attributes beneath the attr: prefix.
Nested attributes MAY be stored in a flat representation with : separators,
e.g. attr:parent:child:key = value.
However, storing a structure in a serialised form like JSON is also acceptable.
Fields
These fields are defined in addition to those defined by any parent classes.
Required fields
These fields MUST exist in the file.
- None
Optional fields
These fields MAY exist in the file.
attr
- data type: map with variable-length string keys and variable-length string values
- nullable: yes
Arbitrary attributes set on a per-row basis.
Keys SHOULD NOT use the attr: prefix.
attr:*
- data type: various
- nullable: various
Arbitrary fields beneath the attr: prefix.
Derived fields
These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.
- None
Spatial (abstract)
This abstract schema defines metadata and fields available to all neurarrow types containing spatial data.
Parent schemas
This schema inherits all fields and metadata from the following schemas:
Schema metadata
These metadata keys are defined in addition to those defined by any parent schemas.
Required schema metadata
These metadata MUST exist in the schema’s metadata.
unit
- encoding: UTF-8 string
Empty for arbitrary units (e.g. voxels with unknown resolution), or the full name of a spatial unit according to UDUNITS-2 as below:
yoctometer, zeptometer, attometer, femtometer, picometer, nanometer, angstrom, micrometer, millimeter, centimeter, inch, decimeter, foot, yard, meter, dekameter, hectometer, kilometer, mile, megameter, gigameter, terameter, petameter, parsec, exameter, zettameter, yottameter
Optional schema metadata
These metadata MAY exist in the schema’s metadata.
space
- encoding: UTF-8 string
Unique identifier for the space in which the data exists.
Fields
These fields are defined in addition to those defined by any parent classes.
Required fields
These fields MUST exist in the file.
Optional fields
These fields MAY exist in the file.
Derived fields
These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.
Point clouds
Point clouds are generic points in 3D space. Multiple point clouds (fragments) can be stored in one table.
Parent schemas
This schema inherits all fields and metadata from the following schemas:
Schema metadata
These metadata keys are defined in addition to those defined by any parent schemas.
Required schema metadata
These metadata MUST exist in the schema’s metadata.
- None
Optional schema metadata
These metadata MAY exist in the schema’s metadata.
frag:*:*
- encoding: various
Individual fragments MAY have arbitrary metadata set with keys like frag:{fragment_id}:{key}, e.g. frag:619:name.
Fields
These fields are defined in addition to those defined by any parent classes.
Required fields
These fields MUST exist in the file.
sample_id
- data type: uint64
- nullable: no
An ID for this point, which MUST be unique within the context.
fragment_id
- data type: uint64
- nullable: no
The ID of the point cloud to which the sample belongs.
x, y, z
- data type: float64
- nullable: no
The location of the point in 3D, in the units given in the schema metadata.
Optional fields
These fields MAY exist in the file.
Derived fields
These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.
Skeletonised cells
Cells are often described in skeletonised form, as a rooted tree graph. The root SHOULD be the cell body.
Parent schemas
This schema inherits all fields and metadata from the following schemas:
Schema metadata
These metadata keys are defined in addition to those defined by any parent schemas.
Required schema metadata
These metadata MUST exist in the schema’s metadata.
- None
Optional schema metadata
These metadata MAY exist in the schema’s metadata.
- None
Fields
These fields are defined in addition to those defined by any parent classes.
Required fields
These fields MUST exist in the file.
parent_id
- data type:
uint64 - nullable: yes
- exactly one sample per fragment MUST be null, which MUST the root; conventionally the cell body or nearest node to it
The ID of the parent node for this sample. MUST be defined elsewhere in the file.
Optional fields
These fields MAY exist in the file.
radius
- data type:
float64 - nullable: yes
- where radius is not known
An approximation of the radius of the cell around this sample, in the units given in the schema metadata.
Derived fields
These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.
child_ids
- data type:
list[uint64] - nullable: yes
- where children are not calculated
- nodes known to be leaves (no children) SHOULD use an empty list instead of null
The IDs of samples which have this sample as a parent.
n_children
- data type:
uint32 - nullable: yes
- where children are not calculated
- notes known to be leaves (no children) SHOULD use 0 instead of null
How many child nodes a particular sample has.
strahler
- data type:
uint32 - nullable: yes
- where strahler index is not calculated
The Strahler number of this sample.
Connections
Connections are relationships between point cloud samples other than morphological continuity.
Schema metadata
Required schema metadata
These metadata MUST exist at the schema level.
- None
Optional schema metadata
These metadata MAY exist at the schema level.
- None
Fields
Required fields
These fields MUST exist in the file.
connection_id
- data type: uint64
- nullable: no
An ID for this connection, which MUST be unique within the context.
src_sample_id
- data type: uint64
- nullable: no
An ID for the point cloud sample at the start of this edge.
MUST exist as a sample_id in an accessible skeleton table.
If the edge is directed, the logical direction of the connection is from src to tgt.
If the edge is undirected, the samples are interchangeable.
tgt_sample_id
- data type: uint64
- nullable: no
An ID for the point cloud sample end of this edge.
MUST exist as a sample_id in an accessible skeleton table.
If the edge is directed, the logical direction of the connection is from src to tgt.
If the edge is undirected, the samples are interchangeable.
type
- data type: dictionary with index type uint16 and value type variable-length string
- nullable: no
The type of this connection. Acceptable values in this specification are
synapse(directed): a chemical synapse where thesrcsample is presynaptic and thetgtsample is postsynapticgap_junction(undirected): an electrical synapse
Undirected connection types SHOULD NOT be repeated to represent both directions
(i.e. if 1 =gap_junction=> 2 is defined, do not explicitly define 2 =gap_junction=> 1).
Extensions MAY add additional connection types,
which MUST be prefixed by the extension name and a colon, e.g. com.example.my_extension:my_type.
Optional fields
These fields MAY exist in the file.
- None
Derived fields
These fields MAY exist in the file, but MUST be calculable from other data.
src_fragment_id
- data type: uint64
- nullable: yes
The ID of the skeleton to which the source sample belongs. MUST be the fragment ID associated with the sample ID in the skeleton table of this context.
Note that while sample IDs SHOULD be stable, the fragment to which they belong MAY not be if the data evolves.
tgt_fragment_id
- data type: uint64
- nullable: yes
As for src_fragment_id, in reference to the tgt_sample_id.
Dotprops
Dotprops are point clouds used in NBLAST and related calculations.
Parent schemas
Schema metadata
Required schema metadata
These metadata MUST exist at the schema level.
neighborhood_size
- encoding: ASCII base-10 unsigned integer
How many nearest neighbors were used to calculate the tangent vector (referred to as k in literature).
Optional schema metadata
These metadata MAY exist at the schema level.
- None
Fields
These fields are defined in addition to those defined by any parent classes.
Required fields
These fields MUST exist in the file.
tangent_x, tangent_y, tangent_z
- data type: float64
- nullable: no
The normalised tangent vector of the neighborhood around the point in 3D, in the units given in the schema metadata.
Optional fields
These fields MAY exist in the file.
colinearity
- data type: float64
- nullable: no
A value between 0 and 1 representing how colinear the points in the neighborhood are (referred to as α / alpha in literature).