About backfilling
Backfilling is the process of syncing all the data in the network from scratch. You may want to do this if you're running a service that requires a complete copy of the data in the network. This is not generally necessary for running feed generators, labelers, or bots, as most of the time they are fine handling live data off of the firehose. However, backfilling may be of interest if you want to perform large-scale data analysis.
For the entire network to be backfillable by third parties at all is a novel concept for AT. Other, monolithic social networks generally only offer an large-scale event stream (like our firehose) from the current date and time, making it difficult to perform longitudinal data analysis without additional data vendors. With the AT Protocol and adequate resources, you can always backfill the entire network on your own. This, in turn, benefits researchers and other forms of data analysis — if you can provision enough storage, you can have your own local copy of the entire Atmosphere.
When backfilling, you generally need to maintain 'up to date' replica of the data, which requires a cutover to streaming firehose data once the backfill is complete. We created tap to streamline this process.
Using tap
tap simplifies AT sync by handling the firehose connection, verification, backfill, and filtering. Your application connects to a Tap and receives simple JSON events for only the repos and collections you care about, no need to worry about binary formats for validating cryptographic signatures.
Tap can be run from the command line:
# Run tap
go run ./cmd/tap run --disable-acks=true
# By default, the service uses SQLite at `./tap.db` and binds to port `:2480`.
# In a separate terminal, connect to receive events:
websocat ws://localhost:2480/channel
# Add a repo to track
curl -X POST http://localhost:2480/repos/add \
-H "Content-Type: application/json" \
-d '{"dids": ["did:plc:ewvi7nxzyoun6zhxrhs64oiz"]}' # @atproto.com repo
When a repo is added, tap provides:
- Historical backfill: Tap fetches the full repo from the account's PDS using com.atproto.sync.getRepo
- Live event buffering: Any firehose events for this repo during backfill are held in memory
- Ordering guarantee: Historical events (marked live: false) are delivered first
- Cutover: After historical events complete, buffered live events are drained
- Live streaming: New firehose events are delivered immediately (marked live: true)
We also provide a TypeScript library for working with Tap. For more information, refer to The AT Stack and the tap repository.
How backfilling works
If you are implementing backfilling on your own, the general process is:
- Given a DID, check your current 'revision' for that DID (Each change to a repo is tagged with a 'revision' or 'rev' string that is a lexicographically sortable timestamp).
- If you do not have a rev for that repo, download and process the users repo checkpoint from the
com.atproto.sync.getRepoendpoint. - While you are doing that, buffer any events for the repo to go through after the checkpoint has been processed.
- The checkpoint will contain a rev value that you can use to skip any buffered events that have already been included in said checkpoint.
- For each buffered event, if the rev is less than the current rev you have, you can safely skip it.
Do the above process for each repo and you will end up with a complete replica of the network. To get a list of all the repos, you can use the com.atproto.sync.listRepos endpoint on the relay, or on each PDS.
- This is a fairly large amount of data (hundreds of GBs at the time of writing), and will be somewhat demanding in terms of resources.
- Be careful not to get rate limited. You will be making one call to
getRepoper user. It is recommended to implement client side rate limiting to prevent your requests from getting blocked by firewalls on the PDS or relay you are requesting data from.
Further Reading and Resources
- Sync
- Streaming data
- Feeds
- Repository spec
- Event Stream spec
- Sync spec
- The Microcosm community project maintains tools for working with AT records at scale without local mirroring.