Lexicon is a schema definition language used to describe atproto records, HTTP endpoints (XRPC), and event stream messages. It builds on top of the atproto Data Model.
This specification describes version 1 of the Lexicon definition language.
Overview of Types
|Lexicon Type||Data Model Type||Category|
Lexicons are JSON files associated with a single NSID. A file contains one or more definitions, each with a distinct short name. A definition with the name
main optionally describes the "primary" definition for the entire file. A Lexicon with zero definitions is invalid.
A Lexicon JSON file is an object with the following fields:
lexicon(integer, required): indicates Lexicon language version. In this version, a fixed value of
id(string, required): the NSID of the Lexicon
revision(integer, optional): indicates the version of this Lexicon, if changes have occurred
description(string, optional): short overview of the Lexicon, usually one or two sentences
defs(map of strings-to-objects, required): set of definitions, each with a distinct name (key)
Schema definitions under
defs all have a
type field to distinguish their type. A file can have at most one definition with one of the "primary" types. Primary types should always have the name
main. It is possible for
main to describe a non-primary type.
References to specific definitions within a Lexicon use fragment syntax, like
com.example.defs#someView. If a
main definition exists, it can be referenced without a fragment, just using the NSID.
The semantics of the
revision field have not been worked out yet, but are intended to help third parties identity the most recent among multiple versions or copies of a Lexicon.
Related Lexicons are often grouped together in the NSID hierarchy. As a convention, any definitions used by multiple Lexicons are defined in a dedicated
*.defs Lexicon (eg,
com.atproto.server.defs) within the group. A
*.defs Lexicon should not include a definition named
main, though it is not strictly invalid to do so.
Primary Type Definitions
The primary types are:
query: describes an XRPC Query (HTTP GET)
procedure: describes an XRPC Procedure (HTTP POST)
subscription: Event Stream (WebSocket)
record: describes an object that can be stored in a repository record
Each primary definition schema object includes these fields:
type(string, required): the type value (eg,
description(string, optional): short, usually only a sentence or two
key(string, required): specifies the Record Key type
record(object, required): a schema definition with type
object, which specifies this type of record
Query and Procedure (HTTP API)
parameters(object, optional): a schema definition with type
params, describing the HTTP query parameters for this endpoint
output(object, optional): describes the HTTP response body
description(string, optional): short description
encoding(string, required): MIME type for body contents. Use
application/jsonfor JSON responses.
schema(object, optional): schema definition, either an
ref, or a
unionof refs. Used to describe JSON encoded responses, though schema is optional even for JSON responses.
input(object, optional, only for
procedure): describes HTTP request body schema, with the same format as the
errors(array of objects, optional): set of string error codes which might be returned
name(string, required): short name for the error type, with no whitespace
description(string, optional): short description, one or two sentences
Subscription (Event Stream)
parameters(object, optional): same as Query and Procedure
message(object, optional): specifies what messages can be
description(string, optional): short description
schema(object, required): schema definition, which must be a
errors(array of objects, optional): same as Query and Procedure
Subscription schemas (referenced by the
schema field under
message) must be a
union of refs, not an
Field Type Definitions
As with the primary definitions, every schema object includes these fields:
type(string, required): fixed value for each type
description(string, optional): short, usually only a sentence or two
No additional fields.
default(boolean, optional): a default value for this field
const(boolean, optional): a fixed (constant) value for this field
When included as an HTTP query parameter, should be rendered as
false (no quotes).
A signed integer number.
minimum(integer, optional): minimum acceptable value
maximum(integer, optional): maximum acceptable value
enum(array of integers, optional): a closed set of allowed values
default(integer, optional): a default value for this field
const(integer, optional): a fixed (constant) value for this field
format(string, optional): string format restriction
maxLength(integer, optional): maximum length of value, in UTF-8 bytes
minLength(integer, optional): minimum length of value, in UTF-8 bytes
maxGraphemes(integer, optional): maximum length of value, counted as Unicode Grapheme Clusters
minGraphemes(integer, optional): minimum length of value, counted as Unicode Grapheme Clusters
knownValues(array of strings, options: a set of suggested or common values for this field. Values are not limited to this set (aka, not a closed enum).
enum(array of strings, optional): a closed set of allowed values
default(string, optional): a default value for this field
const(string, optional): a fixed (constant) value for this field
Strings are Unicode. For non-Unicode encodings, use
bytes instead. The basic
maxGraphemes validation constraints work with Grapheme Clusters, which have a complex technical and linguistic definition, but loosely correspond to "distinct visual characters" like Latin letters, CJK characters, punctuation, digits, or emoji (which might comprise multiple Unicode codepoints and many UTF-8 bytes).
format constrains the string format and provides additional semantic context. Refer to the Data Model specification for the available format types and their definitions.
default are mutually exclusive.
minLength(integer, optional): minimum size of value, as raw bytes with no encoding
maxLength(integer, optional): maximum size of value, as raw bytes with no encoding
No type-specific fields.
See Data Model spec for CID restrictions.
items(object, required): describes the schema elements of this array
minLength(integer, optional): minimum count of elements in array
maxLength(integer, optional): maximum count of elements in array
In theory arrays have homogeneous types (meaning every element as the same type). However, with union types this restriction is meaningless, so implementations can not assume that all the elements have the same type.
A generic object schema which can be nested inside other definitions by reference.
properties(map of strings-to-objects, required): defines the properties (fields) by name, each with their own schema
required(array of strings, optional): indicates which properties are required
nullable(array of strings, optional): indicates which properties can have
nullas a value
As described in the data model specification, there is a semantic difference in data between omitting a field; including the field with the value
null; and including the field with a "false-y" value (
0, empty array, etc).
accept(array of strings, optional): list of acceptable MIME types. Each may end in
*as a glob pattern (eg,
*/*to indicate that any MIME type is accepted.
maxSize(integer, optional): maximum size in bytes
This is a limited-scope type which is only ever used for the
parameters field on
subscription primary types. These map to HTTP query parameters.
required(array of strings, optional): same semantics as field on
properties: similar to properties under
object, but can only include the types
unknown; or an
arrayof one of these types
Note that unlike
object, there is no
nullable field on
Tokens are empty data values which exist only to be referenced by name. They are used to define a set of values with specific meanings. The
description field should clarify the meaning of the token.
Tokens are similar to the concept of a "symbol" in some programming languages, distinct from strings, variables, built-in keywords, or other identifiers.
For example, tokens could be defined to represent the state of an entity (in a state machine), or to enumerate a list of categories.
No type-specific fields.
ref(string, required): reference to another schema definition
Refs are a mechanism for re-using a schema definition in multiple places. The
ref string can be a global reference to a Lexicon type definition (an NSID, optionally with a
#-delimited name indicating a definition other than
main), or can indicate a local definition within the same Lexicon file (a
# followed by a name).
refs(array of strings, required): references to schema definitions
closed(boolean, optional): indicates if a union is "open" or "closed". defaults to
Unions represent that multiple possible types could be present at this location in the schema. The references follow the same syntax as
ref, allowing references to both global or local schema definitions. Actual data will validate against a single specific type: the union does not combine fields from multiple schemas, or define a new hybrid data type. The different types are referred to as variants.
By default unions are "open", meaning that future revisions of the schema could add more types to the list of refs (though can not remove types). This means that implementations should be permissive when validating, in case they do not have the most recent version of the Lexicon. The
closed flag (boolean) can indicate that the set of types is fixed and can not be extended in the future.
union schema definition with no
refs is allowed and similar to
unknown, as long as the
closed flag is false (the default). An empty refs list with
closed set to true is an invalid schema.
The schema definitions pointed to by a
union are generally objects or types with a clear mapping to an object, like a
record. All the variants must be represented by a CBOR map (or JSON Object) and include a
$type field indicating the variant type.
Indicates than any data could appear at this location, with no specific validation. Note that the data must still be valid under the data model: it can't contain unsupported things like floats.
No type-specific fields.
Strings can optionally be constrained to one of the following
at-identifier: either a Handle or a DID, details described below
cid: CID in string format, details specified in Data Model
datetime: timestamp, details specified below
did: generic DID Identifier
handle: Handle Identifier
nsid: Namespaced Identifier
uri: generic URI, details specified below
language: language code, details specified below
For the various identifier formats, when doing Lexicon schema validation the most expansive identifier syntax format should be permitted. Problems with identifiers which do pass basic syntax validation should be reported as application errors, not lexicon data validation errors. For example, data with any kind of DID in a
did format string field should pass Lexicon validation, with unsupported DID methods being raised separately as an application error.
A string type which is either a DID (type: did) or a handle (handle). Mostly used in XRPC query parameters. It is unambiguous whether an at-identifier is a handle or a DID because a DID always starts with did:, and the colon character (:) is not an allowed in handles.
Full-precision date and time, with timezone information.
Datetime format standards are notoriously flexible and overlapping. Datetime strings in atproto should meet the intersecting requirements of RFC 3339, ISO 8601, and the WHATWG HTML standard.
Best practice is to use UTC timezone, and represent this is a capitalized
Z suffix. An upper-case
T is required for separating the "date" and "time" parts.
Whole seconds precision is required, and arbitrary fractional precision digits are allowed. Best practice is to use at least millisecond precision, and to pad with zeros to the generated precision (eg, trailing
:12.340Z instead of
:12.34Z). Not all datetime formatting libraries support trailing zero formatting. Both millisecond and microsecond precision have reasonable cross-language support; nanosecond precision does not.
Implementations should be aware when round-tripping records containing datetimes of two ambiguities: loss-of-precision, and ambiguity with trailing fractional second zeros. If de-serializing Lexicon records in to native types, and then re-serializing, the string representation may not be the same, which could result in broken hash references, sanity check failures, or repository update churn. A safer thing to do is to deserialize the datetime as a simple string, which ensures round-trip re-serialization.
// preferred 1985-04-12T23:20:50.123Z 1985-04-12T23:20:50.123456Z 1985-04-12T23:20:50.120Z 1985-04-12T23:20:50.120000Z // supported 1985-04-12T23:20:50.1235678912345Z 1985-04-12T23:20:50.100Z 1985-04-12T23:20:50Z 1985-04-12T23:20:50.0Z 1985-04-12T23:20:50.123+00:00 1985-04-12T23:20:50.123-07:00
1985-04-12 23:20:50.123Z 1985-04-12t23:20:50.123Z 1985-04-12T23:20:50.123z 1985-04-12 1985-04-12T23:20Z 1985-04-12T23:20:5Z 1985-04-12T23:20:50.123 +001985-04-12T23:20:50.123Z 23:20:50.123Z
Flexible to any URI schema, following the generic RFC-3986 on URIs. This includes, but isn’t limited to:
ipfs (for CIDs),
dns, and of course
Maximum length in Lexicons is 8 KBytes.
An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646 ("Tags for Identifying Languages"). This is the same standard used to identify languages in HTTP, HTML, and other web standards. The Lexicon string must validate as a "well-formed" language tag, as defined in the RFC. Clients should ignore language strings which are "well-formed" but not "valid" according to the RFC.
As specified in the RFC, ISO 639 two-character and three-character language codes can be used on their own, lower-cased, such as
ja (Japanese) or
ban (Balinese). Regional sub-tags can be added, like
pt-BR (Brazilian Portuguese). Additional subtags can also be added, such as
Language codes generally need to be parsed, normalized, and matched semantically, not simply string-compared. For example, a search engine might simplify language tags to ISO 639 codes for indexing and filtering, while a client application (user agent) would retain the full language code for presentation (text rendering) locally.
When to use
Data objects sometimes include a
$type field which indicates their Lexicon type. The general principle is that this field needs to be included any time there could be ambiguity about the content type when validating data.
The specific rules are:
recordobjects must always include
$type. While the type is often known from context (eg, the collection part of the path for records stored in a repository), record objects can also be passed around outside of repositories and need to be self-describing
unionvariants must always include
$type, except at the top level of
blob objects always include
$type, which allows generic processing.
Lexicons are allowed to change over time, within some bounds to ensure both forwards and backwards compatibility. The basic principle is that all old data must still be valid under the updated Lexicon, and new data must be valid under the old Lexicon.
- Any new fields must be optional
- Non-optional fields can not be removed. A best practice is to retain all fields in the Lexicon and mark them as deprecated if they are no longer used.
- Types can not change
- Fields can not be renamed
If larger breaking changes are necessary, a new Lexicon name must be used.
It can be ambiguous when a Lexicon has been published and becomes "set in stone". At a minimum, public adoption and implementation by a third party, even without explicit permission, indicates that the Lexicon has been released and should not break compatibility. A best practice is to clearly indicate in the Lexicon type name any experimental or development status. Eg,
Authority and Control
The authority for a Lexicon is determined by the NSID, and rooted in DNS control of the domain authority. That authority has ultimate control over the Lexicon definition, and responsibility for maintenance and distribution of Lexicon schema definitions.
In a crisis, such as unintentional loss of DNS control to a bad actor, the protocol ecosystem could decide to disregard this chain of authority. This should only be done in exceptional circumstances, and not as a mechanism to subvert an active authority. The primary mechanism for resolving protocol disputes is to fork Lexicons in to a new namespace.
Protocol implementations should generally consider data which fails to validate against the Lexicon to be entirely invalid, and should not try to repair or do partial processing on the individual piece of data.
Unexpected fields in data which otherwise conforms to the Lexicon should be ignored. When doing schema validation, they should be treated at worst as warnings. This is necessary to allow evolution of the schema by the controlling authority, and to be robust in the case of out-of-date Lexicons.
Third parties can technically insert any additional fields they want in to data. This is not the recommended way to extend applications, but it is not specifically disallowed. One danger with this is that the Lexicon may be updated to include fields with the same field names but different types, which would make existing data invalid.
Usage and Implementation Guidelines
It should be possible to translate Lexicon schemas to JSON Schema or OpenAPI and use tools and libraries from those ecosystems to work with atproto data in JSON format.
Implementations which serialize and deserialize data from JSON or CBOR in to structures derived from specific Lexicons should be aware of the risk of "clobbering" unexpected fields. For example, if a Lexicon is updated to add a new (optional) field, old implementations would not be aware of that field, and might accidentally strip the data when de-serializing and then re-serializing. Depending on the context, one way to avoid this problem is to retain any "extra" fields, or to pass-through the original data object instead of re-serializing it.
Possible Future Changes
The validation rules for unexpected additional fields may change. For example, a mechanism for Lexicons to indicate that the schema is "closed" and unexpected fields are not allowed, or a convention around field name prefixes (
x-) to indicate unofficial extension.