Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Specifications for storing and transmitting neuronal morphology and connectivity data using the Apache Arrow data model, and models compatible with it (e.g. Apache Parquet).

About Apache Arrow

From arrow.apache.org:

Apache Arrow defines a language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Using Apache Arrow gives neurarrow implementors access to a large ecosystem of existing software libraries across languages, as well as the ability to exchange that data between language runtimes and processes with minimal serialisation cost.

The use of standard binary formats such as parquet also allows the data to be read now and in the future without neurarrow-specific implementations.

Prior art

These software packages manage tabular neuroscience data:

These file formats describe tabular neuroscience data:

These specifications build on Apache Arrow with domain-specific schemas:

Development happens on github, and the rendered specification is at https://clbarnes.github.io/neurarrow/.

Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Versioning

This specification utilises semantic versioning.

Before version 1.0.0, all minor changes may be break compatibility. Otherwise,

  • Patch versions are used for non-substantive or bug-fix changes to the text of the specification.
  • Minor versions are used for additions which do not break backward compatibility, i.e.
    • A v1.2 parser SHOULD be able to read all v1.1 files
    • A v1.1 parser MAY be able to partially read v1.2 files
  • Major versions are used for changes which break compatibility, i.e.
    • A v1.x parser MAY not be able to read a v2.y file
    • A v2.x parser MAY not be able to read a v1.y file

Extensions MAY use other versioning schemes.

All version strings MUST conform to the PEP-440 version specifier specification.

Naming

Unless there is a very good reason not to, fields and metadata keys SHOULD use snake_case names.

Data types

Primitives

This specification uses arrow’s nomenclature for primitives. Note that strings MUST be in UTF-8 encoding.

Real values SHOULD be stored as a float64 field, or as a decimal string (e.g. 3.14) in a metadata value. Integer IDs SHOULD be stored as uint64 field, or as a decimal string (e.g. 123) in a custom metadata value.

Attributes

Here, attributes are defined as arbitrary unstructured data and metadata.

Consider defining an extension to make your additional data discoverable and re-usable.

Attribute arrow metadata

Schema metadata and field metadata MAY contain unstructured arbitrary attributes, whose keys MUST be prefixed by attr:. Nested attributes MAY be encoded with :-separated elements (e.g. attr:parent_container:child_field), although storing a structure in a serialised form like JSON is also acceptable.

Attribute fields

Arbitrary attribute fields MAY be added to any schema. The name of the field MUST be prefixed by attr:.

Additionally, all schemas MAY have an attr field, which MUST be a nullable map from variable-length string keys to variable-length bytes values. These keys SHOULD NOT have an attr: prefix.

Neurarrow-specific metadata

Contexts

A single logical dataset may span multiple on-disk tables, either due to partitioning of a single logical table, or where tables of multiple types refer to each other. Different datasets may repeat IDs (e.g. for low integers). The context identifier is an arbitrary UTF-8 string identifying a shared context in which all IDs MUST be unique; strictly it is the (context, ID) pair which is globally unique.

It is RECOMMENDED that the identifier be an IRI or (hex-encoded) UUID to ensure uniqueness.

Spaces

Samples taken from two spatial experiments MAY have the same coordinates, but these may not actually be in the same physical location. Spatial datasets are routinely transformed from one “space” to another for comparison. It is important to track which space a dataset belongs to, to know whether it can be compared to another.

The space identifier is an arbitrary UTF-8 string identifying the space from which spatial data are taken (e.g. animals, transforms). Data sets from different spaces SHOULD NOT be spatially overlaid without transformation.

It is RECOMMENDED that the identifier be an IRI or UUID to ensure uniqueness.

Data from two contexts MAY share the same space (e.g. after transforming one experiment’s data to another’s space). Data from one context MUST share a space.

Schemas

Schemas have metadata, containing:

  • required keys which MUST exist
  • optional keys which MAY exist
  • attribute keys
  • extension keys

Metadata keys and values MUST be strings, but MAY encode non-string data (e.g. numbers in decimal representation).

Schema metadata MAY contain arbitrary attributes; their keys MUST be prefixed by attr: as described in the Attributes section.

Schema metadata MAY contain extension metadata; their keys MUST be prefixed by the unique name of the extension as described in Extensions.

Schema fields

Schemas have fields (a.k.a. columns) described in this specification:

  • required fields which MUST exist
  • optional fields which MAY exist
  • derived fields which MAY exist, but MUST be calculable from other fields in the same context
    • derived fields MAY be invalidated if the fields they depend on are updated
  • extension fields whose name MUST be prefixed by the extension’s unique name as described in Extensions
  • attribute fields which MAY exist and whose name MUST be prefixed with attr: as described in Attributes
    • the attr field described in Attributes is also an attribute field

In Arrow, fields can have metadata. This feature is currently unused by neurarrow, but MAY be in future. Writers and extensions SHOULD NOT add or rely on field metadata.

Inheritance

Certain (“child”) schemas inherit from another (“parent”) schema. This means that the child schema:

  • MUST have all of the parents’ required fields and metadata
  • MAY have all of the parents’ optional and derived fields and metadata

Child schemas MAY inherit from more than one parent schema.

The Base and Spatial schemas are provided as abstract schemas. They SHOULD NOT be written as files themselves, but define a parent schema other (concrete) schemas MAY inherit from.

Storage

Individual tables SHOULD be stored as:

  • Arrow IPC File format
    • Also known as Feather format
    • Best for inter-process communication or memory mapped I/O applications
    • Extension .arrow
  • Parquet
    • Best for longer-term, potentially more space-efficient storage
    • Extension .parquet
  • Hive partitioned parquet
    • Parquet files split into chunks in directories based on a particular column’s value
    • Best for very large datasets, particularly in high-latency environments like cloud storage
    • Paths like .../fragment_id=123/*.skeleton.parquet
    • Note that modifying schema and field metadata on hive-partitioned data can mean updating a lot of files

The extension SHOULD be prefixed with the name of the file schema, like brain.skeletons.parquet or cns.connections.feather.

WARNING

Parquet does not directly support uint64 data (used for IDs in neurarrow). Large uint64 values are usually mapped to the negative range of parquet’s int64 data type, and then parsed back to uint64 when read back into arrow.

Implementations

Software implementing neurarrow should be listed here.

Software libraries

User-facing tools

Extensions

Neurarrow is designed to be extensible to support different use cases.

Naming extensions

Extensions MUST have unique names, and SHOULD ensure this by incorporating the web domain of the controlling entity in reverse DNS format, like com.example.my_extension.

Extension metadata

All metadata keys associated with an extension MUST be prefixed with the unique name of the extension and a colon, like com.example.my_extension:my_key. Extensions MAY add any number of required or optional keys.

All extensions MUST define a version metadata key, whose value MUST be a string conforming to the PEP-440 specification. Extension authors MAY use whatever versioning scheme they prefer (e.g. semantic, effort, calendar) and document it separately.

The extension version is required in all files using the extension, so that readers can check whether an extension is in use by checking for the existence of this key.

Extension authors SHOULD document all required and optional metadata keys and the format of their values.

Extension fields

Extensions MAY add new fields to any schema. These MUST have names prefixed by the extension name and a colon (com.example.my_extension:my_field).

Extension authors SHOULD document all required, optional, and derived fields.

Extension example

This section is not normative.

  • The owner of https://example.com develops an extension to represent spatially transformed skeletons
  • Their extension is named com.example.transform
  • The schema metadata of tables using this extension includes com.example.transform:version and com.example.transform:uuid
  • The skeletons schema now includes com.example.transform:original_xyz, a struct{x:float64, y:float64, z:float64} field

Extension by inheritance

This section is not normative.

Rather than extending an existing schema, new schemas MAY be created by which inherit from another. This is NOT RECOMMENDED, unless a new type of data is being represented. Data from extension A and extension B can both exist in the same table if they extend the same schema, but child schema X and child schema Y can only be composed by creating a third child schema Z which inherits from both.

For example, meshes can be represented as a table of vertices and a table of polygons referencing those vertices. The vertex table can re-use the point cloud schema, possibly with an extension encoding features like vertex normals. The polygon table could then create a new schema inheriting from base because there is no similar schema already.

Developing extensions

This section is non-normative.

Extensions should have a single concern, to facilitate modularity and re-use. For example, rather than defining a single general “metadata we like to keep track of in our lab” extension, with multiple fields for different purposes, consider multiple extensions.

Your extension documentation should list

  • which schemas it affects
  • any added fields
    • include their data types
    • include whether they are required by the extension, optional
      • if they are derived, include how to calculate them
  • any added schema or field metadata keys
    • include how to interpret their value
  • which version(s) of neurarrow it targets
  • whether it depends on any other extensions
    • if so, which version(s)
  • any known implementations

Ideally, extension documentation should be based on a public version-controlled repository (e.g. codeberg, gitlab, github), and listed below. Raise an issue, submit a PR, or contact the neurarrow developers to get your extension listed.

Known extensions

If you publish a neurarrow extension, please list it here.

NameDescriptionStatusURL
net.clbarnes.swcInteroperability with the SWC skeleton formatExperimentalhttps://github.com/clbarnes/neurarrow-ext/blob/main/extensions/swc.md

Wishlist

The below are examples of tabular neuromorphology data which could be defined as extensions of neurarrow.

  • navis uses a SWC-like pandas dataframe for its TreeNeuron morphology, and more dataframes for connectivity and Dotprops
    • standardise schema metadata fields
    • add derived fields for caching purposes
    • navis uses a connector table which has its own location; this could be an extension of the connections schema or a new schema

Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

0.2.1 - 2026-02-05

Major build process overhaul, minor corrections

0.2.0 - 2026-02-02

Added

  • Extension scheme
  • attr fields
  • context metadata
  • Connections schema
  • Inheritance concept

Removed

  • Connectors schema
  • Connectors-related fields in skeletons schema
  • labels field in skeletons schema
    • these can be re-added by an extension

0.1.0 - 2024-01-12

Initial implementation.

  • Skeletons
  • Dotprops
  • Connectors

Base (abstract)

This abstract schema defines metadata and fields available to all neurarrow tables.

Parent schemas

None

Schema metadata

These metadata keys are defined in addition to those defined by any parent schemas.

Required schema metadata

These keys MUST exist in the schema’s metadata.

version

  • encoding: UTF-8 string

The version of the neurarrow specification used to write the data.

context

  • encoding: UTF-8 string

The identifier for the context shared by all elements in this file, and possibly other files too; see Contexts.

Optional schema metadata

These keys MAY exist in the schema’s metadata.

attr:*

  • encoding: various

Unstructured arbitrary attributes beneath the attr: prefix. Nested attributes MAY be stored in a flat representation with : separators, e.g. attr:parent:child:key = value. However, storing a structure in a serialised form like JSON is also acceptable.

Fields

These fields are defined in addition to those defined by any parent classes.

Required fields

These fields MUST exist in the file.

  • None

Optional fields

These fields MAY exist in the file.

attr

  • data type: map with variable-length string keys and variable-length string values
  • nullable: yes

Arbitrary attributes set on a per-row basis. Keys SHOULD NOT use the attr: prefix.

attr:*

  • data type: various
  • nullable: various

Arbitrary fields beneath the attr: prefix.

Derived fields

These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.

  • None

Spatial (abstract)

This abstract schema defines metadata and fields available to all neurarrow types containing spatial data.

Parent schemas

This schema inherits all fields and metadata from the following schemas:

Schema metadata

These metadata keys are defined in addition to those defined by any parent schemas.

Required schema metadata

These metadata MUST exist in the schema’s metadata.

unit

  • encoding: UTF-8 string

Empty for arbitrary units (e.g. voxels with unknown resolution), or the full name of a spatial unit according to UDUNITS-2 as below:

yoctometer, zeptometer, attometer, femtometer, picometer, nanometer, angstrom, micrometer, millimeter, centimeter, inch, decimeter, foot, yard, meter, dekameter, hectometer, kilometer, mile, megameter, gigameter, terameter, petameter, parsec, exameter, zettameter, yottameter

Optional schema metadata

These metadata MAY exist in the schema’s metadata.

space

  • encoding: UTF-8 string

Unique identifier for the space in which the data exists.

Fields

These fields are defined in addition to those defined by any parent classes.

Required fields

These fields MUST exist in the file.

Optional fields

These fields MAY exist in the file.

Derived fields

These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.

Point clouds

Point clouds are generic points in 3D space. Multiple point clouds (fragments) can be stored in one table.

Parent schemas

This schema inherits all fields and metadata from the following schemas:

Schema metadata

These metadata keys are defined in addition to those defined by any parent schemas.

Required schema metadata

These metadata MUST exist in the schema’s metadata.

  • None

Optional schema metadata

These metadata MAY exist in the schema’s metadata.

frag:*:*

  • encoding: various

Individual fragments MAY have arbitrary metadata set with keys like frag:{fragment_id}:{key}, e.g. frag:619:name.

Fields

These fields are defined in addition to those defined by any parent classes.

Required fields

These fields MUST exist in the file.

sample_id

  • data type: uint64
  • nullable: no

An ID for this point, which MUST be unique within the context.

fragment_id

  • data type: uint64
  • nullable: no

The ID of the point cloud to which the sample belongs.

x, y, z

  • data type: float64
  • nullable: no

The location of the point in 3D, in the units given in the schema metadata.

Optional fields

These fields MAY exist in the file.

Derived fields

These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.

Skeletonised cells

Cells are often described in skeletonised form, as a rooted tree graph. The root SHOULD be the cell body.

Parent schemas

This schema inherits all fields and metadata from the following schemas:

Schema metadata

These metadata keys are defined in addition to those defined by any parent schemas.

Required schema metadata

These metadata MUST exist in the schema’s metadata.

  • None

Optional schema metadata

These metadata MAY exist in the schema’s metadata.

  • None

Fields

These fields are defined in addition to those defined by any parent classes.

Required fields

These fields MUST exist in the file.

parent_id

  • data type: uint64
  • nullable: yes
    • exactly one sample per fragment MUST be null, which MUST the root; conventionally the cell body or nearest node to it

The ID of the parent node for this sample. MUST be defined elsewhere in the file.

Optional fields

These fields MAY exist in the file.

radius

  • data type: float64
  • nullable: yes
    • where radius is not known

An approximation of the radius of the cell around this sample, in the units given in the schema metadata.

Derived fields

These fields MAY exist in the file, but MUST be calculable from other fields, and MAY be invalidated if the source fields are updated.

child_ids

  • data type: list[uint64]
  • nullable: yes
    • where children are not calculated
    • nodes known to be leaves (no children) SHOULD use an empty list instead of null

The IDs of samples which have this sample as a parent.

n_children

  • data type: uint32
  • nullable: yes
    • where children are not calculated
    • notes known to be leaves (no children) SHOULD use 0 instead of null

How many child nodes a particular sample has.

strahler

  • data type: uint32
  • nullable: yes
    • where strahler index is not calculated

The Strahler number of this sample.

Connections

Connections are relationships between point cloud samples other than morphological continuity.

Schema metadata

Required schema metadata

These metadata MUST exist at the schema level.

  • None

Optional schema metadata

These metadata MAY exist at the schema level.

  • None

Fields

Required fields

These fields MUST exist in the file.

connection_id

  • data type: uint64
  • nullable: no

An ID for this connection, which MUST be unique within the context.

src_sample_id

  • data type: uint64
  • nullable: no

An ID for the point cloud sample at the start of this edge. MUST exist as a sample_id in an accessible skeleton table.

If the edge is directed, the logical direction of the connection is from src to tgt. If the edge is undirected, the samples are interchangeable.

tgt_sample_id

  • data type: uint64
  • nullable: no

An ID for the point cloud sample end of this edge. MUST exist as a sample_id in an accessible skeleton table.

If the edge is directed, the logical direction of the connection is from src to tgt. If the edge is undirected, the samples are interchangeable.

type

  • data type: dictionary with index type uint16 and value type variable-length string
  • nullable: no

The type of this connection. Acceptable values in this specification are

  • synapse (directed): a chemical synapse where the src sample is presynaptic and the tgt sample is postsynaptic
  • gap_junction (undirected): an electrical synapse

Undirected connection types SHOULD NOT be repeated to represent both directions (i.e. if 1 =gap_junction=> 2 is defined, do not explicitly define 2 =gap_junction=> 1).

Extensions MAY add additional connection types, which MUST be prefixed by the extension name and a colon, e.g. com.example.my_extension:my_type.

Optional fields

These fields MAY exist in the file.

  • None

Derived fields

These fields MAY exist in the file, but MUST be calculable from other data.

src_fragment_id

  • data type: uint64
  • nullable: yes

The ID of the skeleton to which the source sample belongs. MUST be the fragment ID associated with the sample ID in the skeleton table of this context.

Note that while sample IDs SHOULD be stable, the fragment to which they belong MAY not be if the data evolves.

tgt_fragment_id

  • data type: uint64
  • nullable: yes

As for src_fragment_id, in reference to the tgt_sample_id.

Dotprops

Dotprops are point clouds used in NBLAST and related calculations.

Parent schemas

Schema metadata

Required schema metadata

These metadata MUST exist at the schema level.

neighborhood_size

  • encoding: ASCII base-10 unsigned integer

How many nearest neighbors were used to calculate the tangent vector (referred to as k in literature).

Optional schema metadata

These metadata MAY exist at the schema level.

  • None

Fields

These fields are defined in addition to those defined by any parent classes.

Required fields

These fields MUST exist in the file.

tangent_x, tangent_y, tangent_z

  • data type: float64
  • nullable: no

The normalised tangent vector of the neighborhood around the point in 3D, in the units given in the schema metadata.

Optional fields

These fields MAY exist in the file.

colinearity

  • data type: float64
  • nullable: no

A value between 0 and 1 representing how colinear the points in the neighborhood are (referred to as α / alpha in literature).