Updates to Repository Sync Semantics
Published on: Aug 24, 2023
We’re excited to announce that we’re rolling out a new version of atproto repositories that removes history from the canonical structure of repositories, and replaces it with a logical clock. We’ll start rolling out this update next week (August 28, 2023).
For most developers with projects subscribed to the firehose, such as feed generators, this change shouldn’t affect you. These will only affect you if you’re doing commit-aware repo sync (a good rule of thumb is if you’ve ever passed
latest to the
com.atproto.sync.getRepo method) or are explicitly checking the repo version when processing commits.
Removing Repository History
Repositories on the AT Protocol are like Git repositories, but for structured records. Just like Git, each commit to an atproto repository currently includes a pointer to the previous commit. However, this approach has caused a couple of pain points:
- Record deletions are difficult to process. If a user deletes a record, that commit needs to be erased from their repository to match their intent.
- Increased storage cost. Maintaining repo history can cause anywhere from a 5-10x increase in repo size.
We attempted to resolve both of these in the current model through rebases (discrete moments when the history of a repository is deleted/mutated, like in Git). However, this is a tricky and sensitive operation that is expensive to conduct and complex to communicate across the network.
Using a Logical Clock for Repositories
To address the above issues, we’re replacing the
prev pointer in commits with a logical clock. We originally published our intention to do so a few weeks ago. These are the changes we’re making to the way we handle repository history:
- Incrementing the repo version to
- Making the
prevfield on repo commits optional
- Adding a new required
rev(revision) field which is a logical clock
- Removing or adjusting commit-aware repo sync mechanisms
Note: If you explicitly verify the version of a repo commit or do strict type checking on commit repo commits (which you shouldn’t — the spec allows unspecified fields!), you will need to make that check inclusive of version 3.
To facilitate backwards compatibility with software that is still running repo v2, we will continue setting the prev field on commits in the interim.
Even though we are setting the prev field, this can be considered a “hint” and the history is no longer considered a canonical part of the repository.
The new sync semantics for the repository rely on a logical clock included in each signed commit.
This “revision” takes the form of a TID and must be monotonically increasing.
The included revision serves a few functions:
The clock provides a simple ordering mechanism for encountered repos or commits. If a consumer encounters the same repo from two different sources, each with a valid signature and structure, the revision gives a simple mechanism to determine which is the most recent repository.
When syncing a repository, revisions give a series of signposts that allow you to request everything from a given repo since a previously seen version. Because revisions are ordered and monotonically increasing, the provider does not necessarily need the exact revision that the consumer is asking for (as with a commit hash), rather they can provide all repo contents from the latest version of the repo that they remember that is before the requested revision.
The PDS for instance will track the revision at which each repo block or record was introduced into a repository. If a consumer asks for every block or record since a given revision, the PDS has a simple mechanism by which to give that information, without needing a complicated sync algorithm.
Finally, a logical clock on the repo gives us a mechanism through which we can detect stale reads. (We actually already snuck this in with an optional revision field on v2 repos!)
Repo revisions may be returned in response headers to most requests. A client will know their own repo’s current revision and can compare that with the upstream service’s revision.
We use this today on the PDS to paper over some read-after-write concerns that are inherent in eventually consistent architectures. Some clients may use these headers to alert their users that their PDS is “out of sync” with other services in the network (for instance an AppView).
Available sync methods
Below is an enumeration of the available sync methods in the
com.atproto.sync namespace along with the changes entailed in this repo update and their deprecation status.
This is the primary RPC sync method. It allows a consumer to download an entire copy of a repository. Optionally, it allows them to signal the last revision they saw so that the provider may be able to send less data.
- Remove optional latest & earliest params
- Add optional
sinceparam (rev of the last seen commit)
- If a consumer sends latest or earliest, they are simply ignored & the consumer will get the full copy of the repo
- With the optional rev param, there is no expectation that a service provides only the blocks created since that rev. We call this a “coarse diff” as additional blocks may be provided.
- The PDS has a simple way of calculating blocks since some rev, if a service has no such mechanism, they are free to send the entire repository along.
This is the primary streaming sync method. It provides a stream of repo commits and their related diffs.
- Added new required
revfield to the commit event (rev of the current commit)
- Added new required
sincefield to the commit event (_previously_ emitted rev for the repo of the current commit)
- We no longer send out rebase events (though they are still technically supported in the schema)
- We continue sending
- Now events will validate against the previous schema
- Deprecate support for rebases
- Possibly deprecate the required
- Possibly deprecate the full route in favor of a new streaming v2 endpoint (TBD)
Takes the place of
getHead (we’re moving away from “head” as a term).
- Changed name of
rootproperty on response to
- Added new
revproperty to response
Same changes as
getRepo - switch from latest & earliest to rev.
These methods will continue to be supported for an interim period but will eventually be fully deprecated.
Deprecated in favor of the new
The functionality is the same as getRepo with no rev set.
Renamed to (and thus deprecated in favor of)
These methods will be removed immediately upon release of repo v3.
The method no longer has meaning with history-less repos.
If you have questions about these changes, join us on GitHub Discussions here.