Data Synchronization

Synchronization of public data between independent network participants is one of the defining features of atproto (which is named for "Authenticated Transfer"). Data redistibution is trustworthy, low-latency, and resource efficient at large scale.

There are two main data synchronization mechanisms in atproto. Batch data transfer is supported using full-repository exports as CAR files over HTTP. Real-time synchronization can be provided by a repository event stream (over WebSocket), commonly referred to as a "firehose". Used together, they enable a complete, live-updated, and authenticated copy of network data.

Repository Revisions

As described in the Repository specification, each commit to a repository has a revision value, encoding as a TID string. Revisions function as a logical clock to track synchronization status individual repositories. The revision must always increase between commits for the same repository, even if the account migrates between hosts or has an extended period of inactivity. To simplify revision management, PDS implementations should use the current wall time (converted to TID) as the revision for every commit. When using wall time, it is still important to check that the new revision is higher than the previous revision, to account for clock drift or concurrent updates.

Synchronizing services should reject or ignore repository updates with revision values corresponding to future timestamps (beyond a short fuzzy time drift window). Services can track the commit revision for every account they have seen, and use this to verify synchronization status.

Clients making API requests to a service may want to know if that service has synchronized and indexed recent updates to the authenticated account's repository. For example, if a client updates a profile record for their account (by writing to the PDS), and then immediately requests an updated view of their profile from a separate service, the client would want to know if the record update was applied. Services can indicate synchronization status using the Atproto-Repo-Rev HTTP response header, which should contain a single commit revision (TID) of the account making the API request.

Repository Exports

Full repository exports can be fetched from the account's PDS host using the com.atproto.sync.getRepo XRPC endpoint (HTTP GET). This endpoint is not authenticated, and returns all repo records, MST nodes, and the current signed commit object, all in a single CAR file.

Other servers may provide cached or mirrored copies of full repository CAR files. It is important for such mirrors to respect repository updates (eg, record deletion) and account status changes (eg, account deactivation or deletion) in a timely manner (within seconds or minutes). This generally means that static repository snapshots should not be redistributed publicly in bulk form (eg, archival datasets or torrent files).

Repository Event Stream

A repository event stream ("firehose") provides real-time updates about changes to repository state (#commit and #sync events), DID documents and handles (#identity events), and account hosting status (#account events). The wire format is WebSockets with CBOR-encoded messages and sequence-based resumption cursors, as described in the Event Stream specification.

Multiple network services provide firehose endpoints under the same com.atproto.sync.subscribeRepos stream endpoint, with compatible message types and semantics. PDS hosts provide a firehose that includes updates for all hosted accounts. Relays are network services which subscribe to multiple upstream firehoses (eg, multiple PDS hosts) and aggregate them in to a single combined event stream. A relay which attempts to aggregate accounts from all PDS instances in the network outputs a "full-network" firehose.

Repository data synchronized over a firehose is self-certifying and contains verifiable signatures. Consuming services can verify synchronized data without making additional requests to the account's PDS host. Missing updates to a single repository are detectable as gaps in the stream of #commit messages. However, if an intermediary service were to filter all messages pertaining to repository, that would not necessarily be detectable by consuming services.

Identity and account information is not self-certifying, and consuming services are responsible for verifying it. This usually means independent DID resolution and handle resolution. Account hosting status can be checked against the account's PDS host, though it is legitimate for intermediary services to apply takedowns.

The event message types are specified below, and are also declared in the com.atproto.sync.subscribeRepos lexicon schema. A few fields are the same for all event types:

seq (integer, required): used to ensure reliable consumption, as described in Event Streams
did (string with DID syntax, required): the account associated with the event. The #commit message is inconsistent and uses repo as the field name
time (string with datetime syntax, required): an informal and non-authoritative estimate of when event was received. Intermediary services may decide to pass this field through as-is, or update to the current time

Firehose event stream messages have a hard maximum size limit of 5 MBytes, measured as WebSocket frames. This is inclusive of all encoding and nesting overhead, and rules out some messages which would not otherwise exceed lexicon schema limits. Messages are generally expected to be well below this hard size limit. This limit may evolve over time.

`#identity` Events

Indicates that there may have been a change to the indicated identity (meaning the DID document or handle), and optionally what the current handle is. Does not indicate what changed, or reliably indicate what the current state of the identity is.

Event fields:

seq (integer, required): same for all event types
did (string with DID syntax, required): same for all event types
time (string with datetime syntax, required): same for all event types
handle (string with handle syntax, optional): the current handle for this identity. May be handle.invalid if the handle does not currently resolve correctly.

Presence or absence of the handle field does not indicate that it is the handle which has changed.

The semantics and expected behavior are that downstream services should update any cached identity metadata (including DID document and handle) for the indicated DID. They might mark caches as stale, immediately purge cached data, or attempt to re-resolve metadata.

Identity events are emitted on a "best-effort" basis. It is possible for the DID document or handle resolution status to change without any atproto service detecting the change, in which case an event would not be emitted. It is also possible for the event to be emitted redundantly, when nothing has actually changed.

Intermediary services (eg, relays) may chose to modify or pass through identity events:

they may replace the handle with the result of their own resolution; or always remove the handle field; or always pass it through unaltered
they may filter out identity events if they observe that identity has not actually changed
they may emit identity events based on changes they became aware of independently (eg, via periodic re-validation of handles)

`#account` Events

Indicates that there may have been a change in Account Hosting status at the service which emits the event, and what the new status is. For example, it could be the result of creation, deletion, or temporary suspension of an account. The event describes the current hosting status, not what changed.

Event Fields:

seq (integer, required): same for all event types
did (string with DID syntax, required): same for all event types
time (string with datetime syntax, required): same for all event types
active (boolean, required): whether the repository is currently available and can be redistributed
status (string, optional): string status code which describes the account state in more detail. Known values include:
- takendown: indefinite removal of the repository by a service provider, due to a terms or policy violation
- suspended: temporary or time-limited variant of takedown
- deleted: account has been deactivated, possibly permanently.
- deactivated: temporary or indefinite removal of all public data by the account themselves.

When coming from any service which redistributes account data, the event describes what the new status is at that service, and is authoritative in that context. In other words, the event is hop-by-hop for repository hosts and mirrors.

See the Account Hosting specification for more context.

`#commit` Events

This event indicates that there has been a new repository commit for the indicated account. The event usually contains the "diff" of repository data, in the form of a CAR slice. See the Repository specification for details on "diffs" and the CAR file format.

Event Fields:

seq (integer, required): same for all event types
repo (string with DID syntax, required): the same as did for all other event types
time (string with datetime syntax, required): same for all event types
rev (string with TID syntax, required): the revision of the commit. Must match the rev in the commit block itself.
since (string with TID syntax, nullable): indicates the rev of a preceding commit, which the the repo diff contains differences from
commit (cid-link, required): CID of the commit object (in blocks)
tooBig (boolean, required): this field is deprecated, but still technically required. Producers should always set it to false, and consumers should ignore it.
blocks (bytes, required): CAR "slice" for the corresponding repo diff. The commit object must always be included, and the CAR header must indicate the commit block as the first "root". See size limit note below.
ops (array of objects, required): list of record-level operations in this commit: specific records created, updated, deleted. See length limitation below.
blobs (array of cid-link, required): this field is deprecated, but still technically required. Producers should set it to an empty array, and consumers should ignore it.
prevData (cid-link, semi-optional): the root CID of the MST tree for the last commit of this repository. Similar to the "since" field, which indicates the previous "rev". Effectively required for MST inversion, despite being marked "optional".

ops object fields:

action (string, required): indicates the operation type. One of: 'create', 'update', or 'delete'.
path (string, required): record path within the repository (collection and record key)
cid (cid-link, required, nullable): indicates new version of record, or null if record has been deleted
prev (cid-link, optional): indicates previous version of record (for 'update' and 'delete'), or not defined (for 'create')

Commit events are broadcast when the account repository changes. Commits can be "empty", meaning no actual record content changed, and only the rev was incremented. They can contain a single record update, or multiple updates.

Commit events have the following size limits:

the blocks bytes field has a hard size limit of 2 million bytes
individual record blocks within the blocks field have a hard limit of one million bytes
at most 200 record operations can be included in a commit

As an example, a single commit can not contain 50 record operations each including 60 KBytes: the limits on number of operations and per-record size would be met, but the blocks field size would be too large.

`#sync` Events

This event asserts the current status of an account's repository. This may be a confirmation or clarification of the state (if nothing changed), or may reset the repository to a new state.

Event Fields:

seq (integer, required): same for all event types
did (string with DID syntax, required): same for all event types
time (string with datetime syntax, required): same for all event types
rev (string with TID syntax, required): the revision of the commit. Must match the rev in the commit block itself.
blocks (bytes, required): CAR slice containing the current commit block. The CAR header must indicate the commit block as the first "root".

Sync events are broadcast when the account repository state has been reset to a new state, or in situations where there might be ambiguity about the current state of the repository. For example, a #sync event could be emitted for an account reactivating after data corruption.

Note that the repository contents are not included in the sync event: the blocks field only contains the repo commit object. Downstream services would need to fetch the full repo CAR file to re-synchronize.

Reliable Repository Synchronization

If a service receives and processes every #commit message from a repository, it should have a complete and coherent view of all records in the repository. If messages are missing or mangled, the receiving service might have an incomplete or incorrect view of some records. How can a service be confident it has successfully synchronized a complete repository?

If the receiving service maintained a full copy of the repository data structure (MST), it could apply the diff from each #commit and verify the integrity of the complete tree structure. But this process is resource intensive in both compute and storage, and unrealisticly expensive at scale.

Instead, receiving services can verify that each #commit diff is consistent with the previous state of the repository, creating a chain of verification. Only a small amount of state needs to be stored for each repository.

The trick to this process is record operation inversion, as described the Repository specification. #commit messages contain both a repo diff (CAR slice), and an array of record operations. The operations can be applied in reverse against a copy of the partial repo tree contained in the diff blocks. If the list of operations is complete, the root of the tree should be exactly that of the previous commit object of the repository. Note that the tree root (data field) is different from the hash of the signed commit object itself. The #commit message contains both a reference to the previous repo revision (in the since field), and a copy of the previous root tree hash (in the prevData field). Those fields are neither authenticated (signed) nor self-certifying, but they can be used to check the consistency of the #commit message in isolation.

To check that the "chain" of messages is consistent, receiving services should track the repo revision and tree root (data) for each repository. If a #commit message is received which is internally consistent, but the since and prevData references do not match the previous state of the repository, then something has gone wrong.

If the chain of #commit messages is found to have broken, or a #sync message indicates that the repository state has changed, then the service will need to re-synchronize the repository. This often means fetching the full repo CAR export.

If many services attempt to re-synchronize a repository at the same time, the upstream PDS host may be overwhelmed with a "thundering herd" of requests. To mitigate this, receiving services should first attempt to fetch the repo CAR file from their direct upstream (often a relay instance). That gives the upstream an opportunity to coalesce and cache the repository export, which distributes load. The upstream may instead simply use an HTTP redirect to the PDS instance.

The "Record-Level Synchronization" section below describes a design pattern for tracking record state and handling re-synchronization incidents.

Message Validation Checklist

Services consuming firehose event streams may have different validation and verification needs. Intermediary services (like public relays) may do some checks to reduce network abuse, but ultimately consuming services are responsible for validating message structure and verifying authenticity.

Here is a summary of validation rules and behaviors:

services should independently resolve identity data for each DID. They should ignore #commit and #sync events for accounts which do not have a functioning atproto identity (eg, lacking a signing key, or lacking a PDS service entry, or for which the DID has been tombstoned)
services which subscribe directly to PDS instances should keep track of which PDS is authoritative for each DID. They should remember the host each subscription (WebSocket) is connected to, and reject #commit and #sync events for accounts if they come from a stream which does not correspond to the current account for that DID
services should track account hosting status for each DID, and ignore #commit and #sync events for events which are not active
services should verify commit signatures for each #commit and #sync event, using the current identity data. If the signature initially fails to verify, the service should refresh the identity metadata in case it had recently changed. Events with conclusively invalid signatures should be rejected.
services should verify #commit and #sync message fields against the actual signed commit object (within the blocks CAR slice), and reject messages with mismatching values
services should verify #commit message blocks field (CAR diff) against the ops list and prevData field, using record operation MST inversion
services should reject any event messages which exceed reasonable size limits.
services should verify that repository data structures are valid against the specification. Missing fields, incorrect MST structure, or other protocol-layer violations should result in events being rejected.
services may apply rate-limits to identity, account, commit, and sync events, and throttle accounts or upstream services which violate those limits. Rate limits might also be applied to recovery modes such as invalid signatures resulting in an identity refresh, missing or out-of-order commits, etc.
services should ignore #commit and #sync events with a rev lower or equal to the most recent successfully processed rev for that DID, and should reject commit events with a rev corresponding to a future timestamp (beyond a clock drift window of a few minutes)
services should check the since value in #commit events, and if it is not consistent with the most recent rev for that DID, mark the repo as out-of-sync
similarly, services should check #commit message prevData values against the most recent commit object data field
data limits on records specifically should be verified. Events containing corrupt or entirely invalid records may be rejected. for example, a record not being CBOR at all, or exceeding normal data size limits (eg, one million byte limit on record size).
more subtle data validation of records may be enforced, or may be ignored, depending on the service. For example, unsupported CID hash types embedded in records should probably be ignored by relays (even if they violate the atproto data model), but may result in the record or commit event being rejected by an AppView
relays (specifically) should not validate records against lexicons

Record-Level Synchronization

Indexing services often care about processing all records of specific collections, across all accounts in the network. This section describes a design pattern for services to bootstrap existing accounts and records, and then maintain record-granularity synchronization, including re-synchronization incidents.

The service should track the usual account-level state:

identity resolution cache: DID document and handle
account hosting status: both "upstream" and "local" (to support local takedowns)
repo sync metadata: last revision and commit data field
repo sync status

The repo sync status field tracks overall processing status:

desynchronized (default): out-of-sync with current revision, and re-synchronization is needed
in-progress: the service is actively re-synchronizing the repo
synchronized: all records have been processed for the current repository revision

The service also tracks record-level state for records in all relevant repositories and collections in a sorted table:

repository account DID
record path (collection and record key)
record version (CID)

The service consumes from a full-network relay firehose. When #identity events are received, it updates resolution caches as usual. When #account events are received, it updates the "upstream" part of account hosting status. Processing of #commit and #sync events depends on the repo sync status.

When a valid #commit is received for a synchronized account, the service processes all of the record operations, and updates the record state table. For example, created records are inserted into the table, deleted records are removed from the table, and updated records have the version field updated.

If a #sync message is received that matches the current repo status, the message is ignored. If an authenticated #sync or #commit message is received that indicates the chain of commit synchronization messages has been broken for a repository, the repo sync state is updated to desynchronized.

Any #commit or #sync messages received for a repo which is in desynchronized state are ignored (dropped). Any events for a repository in in-progress state are enqueued for processing (eg, in a database).

The service runs a pool of re-synchronization workers. These check for repositories in the desynchronized state, and mark them as in-progress when they start work. The resync process is to fetch the account's full repository CAR file. The worker then "walks" all records from the parsed repo tree, and scans the sorted record-level state table for the repository, and compares record states. If a a record exists in the CAR file but not the record table, that is treated as a creation. If the record versions mismatch, that is processed as an update. If a record is in the table, but not the CAR file, that is treated as a deletion. The table is updated as records are processed. Once the CAR processing is complete, the repo sync metadata is updated, and then any enqueued #commit and #sync messages for that repository are processed. If those are successful, the repo sync status is updated to synchronized, and the worker returns to the pool.

The service may want to filter records to only relevant collections. Only those records need to be tracked in the record state table. Every #commit message for in-scope accounts does need to be fully validated and have the repo sync table updated, even if that commit does not contain any in-scope records. This is to maintain the chain of validated repos states.

Discovering accounts from a full-network firehose will result in data from currently-active accounts getting processed. This is a good place to start for many services, but it is usually desirable to also backfill previously-active accounts as well, which can be done by inserting them in to the repository state stable with status desynchronized. The com.atproto.sync.listReposByCollection XRPC endpoint can be used to get a list of accounts (by DID) which have records in a specific collection (lexicon schema).

Usage and Implementation Guidelines

The Account Lifecycle Best Practices document provides guidance on the sequence of messages to emit during different scenarios, including account creation and account migration between PDS hosts.

Stream events can be processed concurrently across accounts, but they should be processed sequentially in-order for any given account.

Event stream consumers need to track and persist the sequence number of events they have successfully processed, to be used as a cursor value when reconnecting. They may want to separately track the "last received" sequence (for re-connections when messages are still being processed) from "high water mark" (taking in to account messages which are received but still being processed). Note that relay instances generally have distinct sequencing offsets and message ordering, and that connection-level sequences are distinct from repository-level revision numbers.

Security Concerns

General mitigations for resource exhaustion attacks are recommended: event rate-limits, data quotas per account, limits on data object sizes and deserialized data complexity, etc.

Care should always be taken when making network requests to unknown or untrusted hosts, especially when the network locators for those host from from untrusted input. This includes validating URLs to not connect to local or internal hosts (including via HTTP redirects), avoiding SSRF in browser contexts, etc.

To prevent traffic amplification attacks, outbound network requests should be rate-limited by host. For example, identity resolution requests when consuming from the firehose, including DNS TXT traffic volume and DID resolution requests.

Future Work

The subscribeRepos lexicon is likely to be tweaked, with deprecated fields removed, even if this breaks lexicon evolution rules.

The repository export mechanism is likely to support partial synchronization of repository subsets. For example, sub-tree CAR files covering specific collections or repo path ranges.

The firehose event stream sequence/cursor scheme may be iterated on to support sharding, timestamp-based resumption, and easier failover between servers with different sequences.

Alternatives to the full authenticated firehose may be added to the protocol. For example, simple JSON serialization, filtering by record collection type, omitting MST nodes, and other changes which would simplify development and reduce resource consumption for use-cases where full authentication is not necessary or desired.