When to backfill
Backfilling is the process of syncing all the data in the network from scratch. You may want to do this if you're running a service that requires a complete copy of the data in the network.
Backfilling the Bluesky Lexicons in particular is resource-intensive and time consuming. Backfilling other Lexicons typically is not. If you are developing your own application, using its own set of Lexicons, you can backfill all of the data that you are writing, as you write it, using the same code paths. Our Statusphere tutorial provides an example of this.
If you are doing large-scale lookups, you can also make use of backlinks in combination with selective backfilling.
For the entire network to be backfillable by third parties at all is a novel concept for atproto. Other, monolithic social networks generally only offer an large-scale event stream (like our firehose) from the current date and time, making it difficult to perform data analysis without an intermediary.
With the AT Protocol and adequate resources, you can always backfill the entire network on your own. This, in turn, benefits researchers and other forms of data analysis — if you can provision enough storage, you can have your own local copy of the entire Atmosphere.
Feed generators, labelers, and bots that consume data directly from the firehose may or may not require backfilling.
When backfilling, you generally need to maintain 'up to date' replica of the data, which requires a cutover to streaming firehose data once the backfill is complete. We created tap to streamline this process.
Using tap
tap simplifies AT sync by handling the firehose connection, verification, backfill, and filtering. Your application connects to a Tap and receives simple JSON events for only the repos and collections you care about, no need to worry about binary formats for validating cryptographic signatures.
Install tap from the indigo repo after installing Go:
brew install go
go install github.com/bluesky-social/indigo/cmd/tap@latest
Then, run tap from the command line:
# Run tap
go run ./cmd/tap run --disable-acks=true
# By default, the service uses SQLite at `./tap.db` and binds to port `:2480`.
# In a separate terminal, connect to receive events:
websocat ws://localhost:2480/channel
# Add a repo to track
curl -X POST http://localhost:2480/repos/add \
-H "Content-Type: application/json" \
-d '{"dids": ["did:plc:ewvi7nxzyoun6zhxrhs64oiz"]}' # @atproto.com repo
When a repo is added, tap provides:
- Historical backfill: Tap fetches the full repo from the account's PDS using com.atproto.sync.getRepo
- Live event buffering: Any firehose events for this repo during backfill are held in memory
- Ordering guarantee: Historical events (marked live: false) are delivered first
- Cutover: After historical events complete, buffered live events are drained
- Live streaming: New firehose events are delivered immediately (marked live: true)
We also provide a TypeScript library for working with Tap. For more information, refer to The AT Stack and the tap repository. If you are backfilling a large number of records with Tap and need a scalable database service, we've had good experiences with ClickHouse.
You can backfill from individual PDSes or from a Relay — they implement the same endpoints. For a list of available Relays, see The AT Stack.
How backfilling works
If you are implementing backfilling on your own, the general process is:
- Given a DID, check your current 'revision' for that DID (Each change to a repo is tagged with a 'revision' or 'rev' string that is a lexicographically sortable timestamp).
- If you do not have a rev for that repo, download and process the users repo checkpoint from the
com.atproto.sync.getRepoendpoint. - While you are doing that, buffer any events for the repo to go through after the checkpoint has been processed.
- The checkpoint will contain a rev value that you can use to skip any buffered events that have already been included in said checkpoint.
- For each buffered event, if the rev is less than the current rev you have, you can safely skip it.
Do the above process for each repo and you will end up with a complete replica of the network. To get a list of all the repos, you can use the com.atproto.sync.listRepos endpoint on the relay, or on each PDS.
- This is a fairly large amount of data (hundreds of GBs at the time of writing), and will be somewhat demanding in terms of resources.
- Be careful not to get rate limited. You will be making one call to
getRepoper user. It is recommended to implement client side rate limiting to prevent your requests from getting blocked by firewalls on the PDS or relay you are requesting data from.
Further Reading and Resources
- Sync
- Streaming data
- Feeds
- Repository spec
- Event Stream spec
- Sync spec
- The Microcosm community project maintains tools for working with AT records at scale without local mirroring.