Repository
Each atproto account has a repository (or "repo") which stores all of their public data records. Repository contents are entirely public and verifiable ("self-certifying"). Record deletion is supported without leaving a trace or "tombstone" of previous contents.
The repository data structure is a content-addressed Merkle-tree. Creating, updating, or deleting records (or any other mutations to the repository) changes the root hash value of the overall repository tree. Each published version of the repository tree structure is represented as a commit. Commits are cryptographically signed, with rotatable signing keys, which allows recursive authentication of either the repository structure as a whole, or compact "proof chains" for individual records.
Repositories and their contents are represented as a graph of data objects, encoded in DRISL CBOR and referencing each other by content hash (CID Links). Larger binary media files ("blobs") are also referenced by content hash, but are not stored directly in the repository. Complete repositories can be exported as CAR files for synchronization, offline backup, account migration, or other purposes.
In the atproto network architecture, the authoritative location of an account's repository is the associated Personal Data Server (PDS). An account's current PDS location is declared in the DID Document.
Repository Data Structure
At a high level, a repository is a key/value mapping where the keys are path names (as strings) and the values are records (CBOR objects).
A Merkle Search Tree (MST) is used to store this mapping. This content-addressed deterministic data structure stores data in key-sorted order. It is reasonably efficient for key lookups, key range scans, and appends (assuming sorted record paths). Merkle Search Trees and their performance properties were originally described in this research publication:
Alex Auvolat, François Taïani. Merkle Search Trees: Efficient State-Based CRDTs in Open Networks. SRDS 2019 - 38th IEEE International Symposium on Reliable Distributed Systems, Oct 2019, Lyon, France. pp.1-10, ff10.1109/SRDS.2019.00032 (pdf)
You do not need to read the above publication to implement MSTs as they are used in atproto.
Repositories are intended to store up to single-digit millions records. Beyond that they become unweidly to distribute and process.
This document describes version 3 of the repository binary format. Version 2 had a slightly different commit object schema, but is mostly compatible with 3. Version 1 had a different MST fanout configuration, and an incompatible schema for commits and repository metadata. Version 1 is deprecated, no repositories in this format exist in the network, and implementations do not need to support it.
Repository Paths
Repo paths are strings, while MST keys are byte arrays. Neither may be empty (zero-length). While repo path strings are currently limited to a subset of ASCII (making encoding a no-op), the encoding is specified as UTF-8.
Repo paths currently have a fixed structure of <collection>/<record-key>. This means a valid, normalized Namespace ID (NSID), followed by a /, followed by a valid Record Key. The path should not start with a leading /, and should always have exactly two path segments. The ASCII characters allowed in the entire path string are currently: letters (A-Za-z), digits (0-9), slash (/), period (.), hyphen (-), underscore (_), and tilde (~). The specific path segments . and .. are not valid NSIDs or record keys, and will always be disallowed in repo paths.
Note that repo paths for all records in the same collection are sorted together in the MST, making enumeration (via key scan) and export efficient. Additionally, the Timestamp ID (TID) record key scheme was intentionally selected to provide chronological sorting of MST keys within the scope of a collection. Appends are more efficient than random insertions/mutations within the tree, and when enumerating records within a collection they will be in chronological order (assuming that TID generation was done correctly, which cannot be relied on in general).
Commit Objects
The top-level data object in a repository is a signed commit. The data fields are:
did(string, required): the account DID associated with the repo, in strictly normalized form (eg, lowercase as appropriate)version(integer, required): fixed value of3for this repo format versiondata(CID link, required): pointer to the top of the repo contents tree structure (MST)rev(string, TID format, required): revision of the repo, used as a logical clock. Must increase monotonically. Recommend using current timestamp as TID;revvalues in the "future" (beyond a fudge factor) should be ignored and not processed.prev(CID link, nullable): pointer (by hash) to a previous commit object for this repository. Could be used to create a chain of history, but largely unused (included for v2 backwards compatibility). In version3repos, this field must exist in the CBOR object, but is virtually alwaysnull. NOTE: previously specified as nullable and optional, but this caused interoperability issues.sig(byte array, required): cryptographic signature of this commit, as raw bytes
An unsigned commit object has all the same fields except for sig. The process for signing a commit is to populate all the data fields, and then serialize the unsigned commit with DRISL CBOR. The output bytes are then hashed with SHA-256, and the binary hash output (without hex encoding) is then signed using the current "signing key" for the account. The signature is then stored as raw bytes in a commit object, along with all the other data fields.
The CID for a commit overall is generated by serializing a signed commit object as DRISL CBOR. The DRISL CBOR (not "raw") codec should be used for CIDs linking to commit objects. See notes on the "blessed" CID format below.
Note that neither the signature itself nor the signed commit indicate either the type of key used (curve type), or the specific public key used. That information must be fetched from the account's DID document. With key rotation, verification of older commit signatures can become ambiguous. The most recent commit should always be verifiable using the current DID document. This implies that a new repository commit should be created every time the signing key is rotated. Such a commit does not need to update the data CID link.
MST Structure
At a high level, the repository MST is a key/value mapping where the keys are non-empty byte arrays, and the values are CID links to records. The MST data structure should be fully reproducible from such a mapping of bytestrings-to-CIDs, with exactly reproducible root CID hash (aka, the data field in commit object).
Every node in the tree structure contains a set of key/CID mappings, as well as links to other sub-tree nodes. The entries and links are in key-sorted order, with all of the keys of a linked sub-tree (recursively) falling in the range corresponding to the link location. The sort order is from left (lexically first) to right (lexically latter). Each key has a depth derived from the key itself, which determines which sub-tree it ends up in. The top node in the tree contains all of the keys with the highest depth value (which for a small tree may be all depth zero, so a single node). Links to the left or right of the entire node, or between any two keys in the node, point to a sub-tree node containing keys that fall in the corresponding key range.
An empty repository with no records is represented as a single MST node with an empty array of entries. This is the only situation in which a tree may contain an empty leaf node which does not either contain keys ("entries") or point to a sub-tree containing entries. The top of the tree must not be a an empty node which only points to a sub-tree. Empty intermediate nodes are allowed, as long as they point to a sub-tree which does contain entries. In other words, empty nodes must be pruned from the top and bottom of the tree, but empty intermediate nodes must be kept, such that sub-tree links do not skip a level of depth. The overall structure and shape of the MST is deterministic based on the current key/value content, regardless of the history of insertions and deletions that lead to the current contents.
For the atproto MST implementation, the hash algorithm used is SHA-256 (binary output), counting "prefix zeros" in 2-bit chunks, giving a fanout of 4. To compute the depth of a key:
- hash the key (a byte array) with SHA-256, with binary output
- count the number of leading binary zeros in the hash, and divide by two, rounding down
- the resulting positive integer is the depth of the key
Some examples, with the given ASCII strings mapping to byte arrays:
2653ae71: depth "0"blue: depth "1"app.bsky.feed.post/454397e440ec: depth "4"app.bsky.feed.post/9adeb165882c: depth "8"
There are many MST nodes in repositories, so it is important that they have a compact binary representation, for storage efficiency. Within every node, keys (byte arrays) are compressed by eliding common prefixes, with each entry indicating how many bytes it shares with the previous key in the array. The first entry in the array for a given node must contain the full key, and a common prefix length of 0. This key compaction is internal to nodes, it does not extend across multiple nodes in the tree. The compaction scheme is mandatory, to ensure that the MST structure is deterministic across implementations.
The node data schema fields are:
l("left", CID link, nullable): link to sub-tree Node on a lower level and with all keys sorting before keys at this nodee("entries", array of objects, required): ordered list of TreeEntry objectsp("prefixlen", integer, required): count of bytes shared with previous TreeEntry in this Node (if any)k("keysuffix", byte array, required): remainder of key for this TreeEntry, after "prefixlen" have been removedv("value", CID Link, required): link to the record data (CBOR) for this entryt("tree", CID Link, nullable): link to a sub-tree Node at a lower level which has keys sorting after this TreeEntry's key (to the "right"), but before the next TreeEntry's key in this Node (if any)
When parsing MST data structures, the depth and sort order of keys should be verified. This is particularly true for untrusted inputs, but is simplest to just verify every time. Additional checks on node size and other parameters of the tree structure also need to be limited; see the "Security Considerations" section of this document.
CID Formats
The blessed CID format described in Data Model is used for references to commit objects, MST node objects, and records (eg, MST leaf nodes to records).
In the context of repositories, it is desirable for the overall data structure to be reproducible given the contents, so the CID types should be strictly constrained and enforced. Commit objects with non-compliant prev or data links are considered invalid. MST Node objects with non-compliant links to other MST Node objects are considered invalid, and the entire MST data structure invalid.
More flexibility is allowed in processing the "leaf" links from MST to records, and implementations should retain the exact CID links used for these mappings. Implementations should strictly follow the CID blessed format when generating new CID Links to records.
CAR File Serialization
The standard file format for storing data objects is Content Addressable aRchives (CAR). The standard repository export format for atproto repositories is CAR v1, which have file suffix .car and mimetype application/vnd.ipld.car. This aligns with the DASL CAR specification.
The CARv1 format is very simple. It contains a small metadata header (which can indicate one or more "root" CID links), and then a series of binary "blocks", each of which is a data object. In the context of atproto repositories:
- The first element of the CAR
rootsmetadata array must be the CID of the most relevant Commit object. For a generic export, this is the current (most recent) commit. Additional CIDs may also be present in therootsarray, with (for now) undefined meaning or order - For full exports, the full repo structure must be included for the indicated commit, which includes all records and all MST nodes
- The preferred order of blocks within the CAR file is described below. At this time, this order is not required, and parsing implementations (and services) must be tolerant of CAR files with arbitrary block ordering.
- Additional blocks, including records, may or may not be included in the CAR file
When importing CAR files, note that there may existing dangling CID references. For example, repositories may contain CID Links to blobs or records in other repositories, and the blocks corresponding to those blobs or references would likely not be included in the CAR file.
The CARv1 specification is agnostic about the same block appearing multiple times in the same file ("Duplicate Blocks)". Implementations should be robust to both duplication and de-duplication of blocks, and should also ignore any unnecessary or unlinked blocks.
Streamable CAR Block Ordering
The block ordering scheme described here is still work-in-progress. As of February 2026, it has not been included in popular implementations or deployed broadly in the atproto network. Details may change based on implementation experience.
Repository export CAR files are usually parsed in to an MST structure. If the blocks in the CAR file are in a consistent order, this can be done by "walking" references in a single pass, without buffering the entire CAR file in memory. This memory efficiency is helpful when working with large repositories, or processing many repositories in parallel.
The consistent block ordering rules to support this use case are:
- The commit object must be the first block.
- MST nodes are included in "pre-order", meaning that parent nodes precede child nodes and leaf nodes. Record blocks are interleaved between MST nodes. This starts with the root MST node (referenced from the commit
datafield) as the second block. - Following each MST node in the tree, include the blocks corresponding to the entries in that node, in the order they are listed in the node. If an entry slot is a child MST node, include that node (and recurse depth-first). If the entry slot is a record, include the record block.
Repository Diffs
Mutations to a repository tree can be encapsulated as a "diff". The basic idea is that all new data blocks (including the new commit object, MST nodes, and records) that have changed since a previous revision can be bundled together and serialized in CAR format. Conceptually, diffs could be "applied" to mirrors of a repository to keep them updated. This forms the basis of the Data Synchronization part of the specification.
Some details about representing repository diffs are "CAR slices":
- uses the same CAR format described above for full repository exports
- the root CID indicated in the CAR header (the first element of
roots) must point to the new commit block (which must be included) - all MST nodes in the current repo revision which did not exist in the previous repo revision must be included
- any additional MST nodes needed to support "operation inversion" must be included (described below)
- all required blocks must be included even if they appeared previously in the history of the repository.
- for example, if a record is created in rev C, deleted in rev F, and re-created in rev N, the diff "since F" must include the record block
- all "created" records must be included
- any records which have been "deleted" should not have the record value included
- any records which have been "updated" should include the new value, and should not include the previous version
- if a record value has been deleted from one path and created (or updated) at a new path in the same diff (eg, was "moved"), it should be included
- parsing implementation must be tolerant of additional unexpected blocks in the diff, which they should ignore
- note that unreasonable quantities of unnecessary block data may be considered a form of resource abuse
The diff is a partial Merkle tree, including a signed commit, and can be partially verified in isolation. For example, the diff contains a "proof chain" to verify any created or updated records. If both metadata about the previous state of the repository and a complete set of record operations are available, then it becomes possible to fully verify the integrity of the diff.
The sync "firehose" mechanism uses a process called "operation inversion" to validate a list of record operations against a diff. In this process, the diff is parsed as a partial MST, and then each operation is applied in reverse: a "create" is applied as a "delete", etc. After all operations have been applied, the root CID (hash) of the MST is recomputed, and should match the data field of the previous revision of the repository. If the list of record operations is inaccurate or incomplete, the inverted MST will not match.
In some cases the inversion process requires the inclusion of contextual MST nodes in the diff, which would not otherwise need to be included. For example, MST nodes that reference record paths directly adjacent (in sorted order) to those mutated in the diff. The definition of which nodes need to be included in the diff is ultimately tautological: those MST nodes which are necessary for the inversion process. A set of test vectors is included in the atproto test case git repository.
Security Considerations
Repositories are untrusted input: accounts have full control over repository contents, and PDS instances have full control over binary encoding. It is important to handle possible denial of service vectors from both hostile actors or accidental situations (eg, corrupted data or buggy implementations).
Generic precautions should be followed with CBOR decoding: a maximum serialized object size, a maximum recursion depth for nested fields, maximum memory budget for deserialized data, etc. Some CBOR libraries include these precautions by default, but others do not.
The efficiency of the MST data structure depends on key hashes being relatively randomly dispersed. Because accounts have control over record keys, they can mine for sets of record keys with particular depths and sorting order, which result in inefficient tree shapes, which can cause both large storage overhead, and network amplification in the context of firehose event streams. To protect against these attacks, implementations should limit the number of TreeEntries per Node to a statistically unlikely maximum length. It may also be necessary to limit the overall depth of the repo, or other parameters, to prevent more sophisticated key mining attacks.
When importing CAR files, the completeness of the repository structure should be verified. Additional unrelated blocks might be included in the CAR structure; care should be taken when injecting CAR contents directly in to backend block storage, to ensure resources are not wasted on un-referenced blocks. There may also be issues with cross-account contamination from CAR imports, for example previously-deleted records re-appearing via CAR import from an unrelated account.
Possible Future Changes
An optional in-repo mechanism for storing multiple versions of the same record (by path) may be implemented. Eg, adding additional path field to indicate the version by CID, timestamp, or monotonically increasing version integer.
Repo path restrictions may be relaxed in other ways, including fewer or additional path segments, more allowed characters (including non-ASCII), etc. Paths will always be valid Unicode strings, mapped to MST keys (byte arrays) by UTF-8 encoding.
At the overall atproto specification level, additional "blessed" cryptographic algorithms may be added over time. Likewise, additional CID formats for referencing blobs and records may be added. MST node CID format changes would require a repo format version bump.
Repository CAR exports could include linked "blobs" (larger binary files). This might become the default, or a configurable option, or some another mechanism for blob export might be chosen (eg, .tar or .zip export).
Record content could conceivably be something other than CBOR some day. This would probably be a repo format version bump. Note that it is possible to efficiently wrap other data formats in a CBOR wrapper (via a byte array field), or to have a small CBOR record type that links to a blob in arbitrary format.
Adding optional fields to commit and MST node objects may or may not result in a repo format version change. Changing the MST fanout, or any changes to the current MST fields, would be a full repo version change.