Event Platform/Schemas
Motivation and Overview
Event Schemas are essential for an Event Streaming Platform. They allow disparate continuously changing producers and consumers to reliably communicate with each other. By explicitly declaring the shape of data, schemas ease integration between various systems.
Schemas should be readily available for any producer or consumer code that might need it. Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Access of those schemas should be reliable and immutable for any given deployed service.
WMF uses JSON as our preferred in-flight data serialization format, and as such we have chosen* to use JSONSchema for our event schemas. Schema evolution is necessary to be able to reliably upgrade producer and consumer code, but unfortunately, JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file.
WMF has chosen to distribute schemas using Git. This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project. However, even though we use Git, we do not rely on Git history for schema versioning. Each schema version is an explicit static file in the schema repository. For more background, see RFC: Modern Event Platform: Schema Registry.
To make development of many schema versions files in git easier, WMF has developed the jsonschema-tools library. This tooling makes it easier for developers to design and evolve schemas dynamically while allowing production services can use static and immutable versions of those schemas.
jsonschema-tools will be used in the rest of this documentation to set up and develop schemas in a Git schema repository. Please skim the jsonschema-tools README before proceeding.
jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed. You can get NodeJS and npm at nodejs.org. Once installed, cd
to the schema repository and run npm install
. Heads-up: the full path to the directory cannot contain spaces. For example, ~/Documents/analytics\ engineering/event\ schemas/primary
is likely to yield errors, but ~/Documents/analytics-engineering/event-schemas/primary
would be fine.
*There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see RFC: Modern Event Platform - Choose Schema Tech, https://techblog.wikimedia.org/2020/09/10/wikimedias-event-data-platform-or-json-is-ok-too/ and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.
Event Schema Design Rules and Conventions
Event Platform/Schemas/Guidelines
Schema Repositories
A schema repository is a Git repository with a hierarchy of versioned JSONSchema files, with a file layout something like:
jsonschema └── analytics ├── button │ ├── click │ │ ├── 1.0.0 -> 1.0.0.yaml │ │ ├── 1.0.0.yaml │ │ ├── current.yaml │ │ └── latest -> 1.0.0 │ └── release │ ├── 1.0.0 -> 1.0.0.yaml │ ├── 1.0.0.yaml │ ├── 1.0.1 -> 1.0.1.yaml │ ├── 1.0.1.yaml │ ├── current.yaml │ └── latest -> 1.0.1 └── page_preview └── visibility_change ├── 1.0.0 -> 1.0.0.yaml ├── 1.0.0.yaml ├── 2.0.0 -> 2.0.0.yaml ├── 2.0.0.yaml ├── current.yaml └── latest -> 2.0.0
JSONSchema has title
and $id
fields that we use to associate event data with a schema, as well as for semantically versioning schemas. The actual hierarchy layout shown here is arbitrary, but each schema's title
and $id
must match the layout in a specific way. More on this below.
Note the 'current.yaml' files. These files represent the current working version of the schema. The current schemas are never themselves used as a schema for validation or data integration. Instead, they are 'materialized' by jsonschema-tools into static versioned schema files. These versioned schema files are the canonical schemas used by event processing systems.
Hierarchy Rules
Each schema's title
should match its relative path in the schema repository. E.g. all schema version files in namespace1/entity1/verbB
should have title: namespace1/entity1/verbB
. Each schema's $id
field should be set to the path (starting with /
) and (extensionless) version. E.g. namespace1/entity1/verbB/1.0.1.yaml
should have $id: /namespace1/entity1/verbB/1.0.1
.
This layout combined with the title
and $id
allow for event data to specifically point to their schemas via relative URIs. By semantically versioning schema files, jsonschema-tools is able to associate schemas with the same title
and enforce backwards compatibility. The relative and versioned $id
URIs can also be used as JSON $ref
links and with JSON Pointers. More on this below as well.
Creating a new schema repository
Most likely you will already be working with a schema repository. If so, skip to Creating a new schema or Modifying schemas.
jsonschema-tools is a NodeJS library and CLI for managing JSONSchema Git repositories. To create a new schema repository, you'll create a package.json
file, install and configure jsonschema-tools, and set up jsonschema-tools tests.
mkdir my_schema_repository
cd my_schema_repository
git init .
# Our schemas will go in the jsonschema/ directory
mkdir jsonschema
# Create a configuration file for jsonschema-tools.
echo -e 'schemaBasePath: ./jsonschema/\nlogLevel: info' > .jsonschema-tools.yaml
# Create a package.json file. (Modify this as desired.)
echo '
{
"name": "my_schema_repository",
"scripts": {
"test": "mocha test/jsonschema",
"build-modified": "jsonschema-tools materialize-modified --no-git-add",
"build-new": "jsonschema-tools materialize"
},
"devDependencies": {
"@wikimedia/jsonschema-tools": "^0.6.0",
"mocha": "^6.2.0"
}
}
' > package.json
# Install jsonschema-tools.
npm install .
# Install jsonschema-tools tests.
mkdir -p test/jsonschema
echo "
'use strict';
require('@wikimedia/jsonschema-tools').tests.all({ logLevel: 'info' });
" > test/jsonschema/repository.test.js
# Create the first git commit.
echo 'node_modules**' >> .gitignore
git add .
git commit -m 'New schema repository'
Creating a new schema
Once you are working in a repository with jsonschema-tools, we can create new schemas. By 'new schema', we mean a brand new schema lineage, not just a new schema version. To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema. For this example, we'll create a new event schema that represents a Mediawiki UI button click.
NOTE: since will be writing JSONSchema, you should probably know how to do that. See this tutorial and reference for help working with JSONSchema.
mkdir -p jsonschema/mediawiki/desktop/button/click
Open jsonschema/mediawiki/desktop/button/click/current.yaml
. We'll build this up piece by piece and explain each part.
Schema metadata
First we need some schema metadata that describe and identify the schema. Note that this schema metadata is not describing any aspect of your event data.
# This is the title of the schema.
# It should match the relative path to this file's parent directory.
title: mediawiki/desktop/button/click
# Document the what the schema represents.
description: Mediawiki desktop web button clicked
# The $id uniquely identifies this schema. It should be a versioned (and extensionless) URI.
$id: /mediawiki/desktop/button/click/1.0.0
# This is the meta-schema of this schema. This should probably always be the same
# for every schema, and should point to the main JSONSchema meta-schema at json-schema.org.
$schema: https://json-schema.org/draft-07/schema#
Event fields
...continuing on to event data fields. Your event should be a JSON object with each field explicitly declared here.
type: object
additionalProperties: false
properties:
Required event data
In addition to the $schema
field, WMF has defined common fields for event data. These common fields allow us to have some consistency all event data, and are also used to support backend functionality (deduplication, Hive table ingestion, etc.)
$schema
Each event needs to identify its schema. Right now we are just writing the schema, but later on your code
will produce JSON event data that conforms to this schema. We need to be able to look up the schema
for any given event just from the event data itself. To do this, we re-use the JSONSchema $schema
field in the event properties.
$schema:
type: string
description: >
The URI identifying the JSONSchema for this event. This should be
a short URI containing only the name and version at the end of the
URI path. e.g. /schema_name/1.0.0 is acceptable. This should match
the schema's $id field.
Timestamps: meta.dt
and dt
These timestamps have different semantics, but in most cases they will be very close, if not the same. These are both ISO-8601 UTC datetime strings, e.g. '2020-07-01T00:00:00Z'.
Every event happens at a certain date-time. That event time should be stored in the dt
field.
meta.dt
can be used as the event ingestion time, i.e. the time at which the intake system has received the event. Depending on the pipeline your event is flowing through, this might be set be different levels. For events that are received first by our intake service (EventGate), this will be set by it, if it is not already set by the client.
NOTE: meta.dt
will be used as the Kafka timestamp as well as for Hive hourly partitioning. If you don't have strict control over your event producers (e.g. remote browser clients), you should allow EventGate to fill in this field so that you don't end up with incorrect timestamps.
NOTE: As of 2020-12, these meta.dt and dt conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See T240460 and T267648.
meta.stream
Every event should belong to a named dataset. While events are in flight, this dataset is called a stream of events. Each event needs to specify which stream it belongs to. For example, the resource_change schema is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams.
We do this using the meta.stream
field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table. In most cases, the Kafka topic will be the stream name prefixed with the datacenter name where the event was received.)
There are a few more common and optional meta fields that WMF defines, but we don't need explain
them all here. For now we will write out just these 2 example meta
fields.
Later we will show how to include the event meta schema using $ref
.
### Metadata object. All events schemas should have this.
meta:
type: object
properties:
dt:
type: string
# Whenever a format is used on a field, we require that maxLength is also set.
# See https://github.com/epoberezkin/ajv#security-risks-of-trusted-schemas
format: date-time
maxLength: 128
description: Time stamp of the event, in ISO-8601 format
stream:
type: string
description: Name of the stream/queue that this event belongs in
required:
- dt
- stream
Event data fields
Finally we can add any fields that we really want our event to have.
button_name:
type: string
description: Name of the button that was clicked
page_title:
type: string
description: Page the button appeared on when clicked
The new schema
Here is the new schema we just wrote:
title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
properties:
$schema:
type: string
description: >
The URI identifying the JSONSchema for this event. This should be
a short URI containing only the name and version at the end of the
URI path. e.g. /schema_name/1.0.0 is acceptable. This often will
(and should) match the schema's $id field.
### Metadata object. All events schemas should have this.
meta:
type: object
properties:
dt:
type: string
format: date-time
maxLength: 128
description: Time stamp of the event, in ISO-8601 format
stream:
type: string
description: Name of the stream/queue that this event belongs in
required:
- dt
- stream
button_name:
type: string
description: Name of the button that was clicked
page_title:
type: string
description: Page the button appeared on when clicked
examples:
- {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser"}
Note the examples
. This is optional, but can be nice if you want to give schema readers an example of what you expect event data to look like. Notice how the event's $schema
matches exactly the schema's $id
.
Materializing the schema
jsonschema-tools calls the process of derefencing, merging and generating the static versioned files 'materializing'. So far, we've saved this our new schema as ./jsonschema/mediawiki/desktop/button/click/current.yaml
. current.yaml will be the 'current working copy' of a schema. It can contain $ref
URI pointers (more on this below). Any changes we make to schemas should always be done on their current.yaml
files. We'll use jsonschema-tools to materialize current.yaml
into a statically versioned schema file.
WMF's schema repositories are set up with npm scripts to help materialize schemas. (These scripts are just wrappers of the jsonschema-tools CLI).
To materialize a new schema, you'll run npm run build-new
:
# materialize the new current.yaml schema
npm run build-new ./jsonschema/mediawiki/desktop/button/click/current.yaml
[2022-03-21 13:57:14.397 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.0.0 using schema base URIs ./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
[2022-03-21 13:57:14.424 +0000]: Materialized schema at jsonschema/mediawiki/desktop/button/click/1.0.0.json.
[2022-03-21 13:57:14.425 +0000]: Materialized schema at jsonschema/mediawiki/desktop/button/click/1.0.0.yaml.
[2022-03-21 13:57:14.425 +0000]: Created latest symlink jsonschema/mediawiki/desktop/button/click/latest.json -> 1.0.0.json.
[2022-03-21 13:57:14.426 +0000]: Created latest symlink jsonschema/mediawiki/desktop/button/click/latest.yaml -> 1.0.0.yaml.
[2022-03-21 13:57:14.426 +0000]: Created extensionless symlink jsonschema/mediawiki/desktop/button/click/1.0.0 -> 1.0.0.yaml.
[2022-03-21 13:57:14.427 +0000]: Created latest symlink jsonschema/mediawiki/desktop/button/click/latest -> 1.0.0.yaml.
# Git add the new current.yaml schema and the materialized schema files.
git add ./jsonschema/mediawiki/desktop/button/click/*
git commit -m 'Created mediawiki/desktop/button/click 1.0.0 schema'
The version to materialize will be obtained from the value of $id
in current.yaml. Both yaml and json (by default) files will be materialized, and the versioned extensionless symlink will point to the versioned yaml file (by default).
Alternatively you can manually materialize a schema using the jsonschema-tools CLI. See $(npm bin)/jsonschema-tools --help
for more information.
Modifying schemas
Versioned schemas should be (mostly) immutable. Once committed and merged, they may be used by many active producers and consumers. Changing an existent version should not be done (if you think you need to do it, get in touch with the Analytics or Core Platform Engineering teams). Instead, to modify a schema you should just create a new backwards compatible version.
Let's add a user_id to our event data. Edit jsonschema/mediawiki/desktop/button/click/current.yaml
and add the following at the bottom of the schema.
# ...
user_id:
type: string
description: ID of the user
# Add a user_id onto our examples field too:
examples:
- {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}
Since we've changed the schema, we MUST manually change the version in the schema's $id
field. According to semantic versioning, our addition of the user_id
field should be a minor version increment. So change $id
to:
$id: /mediawiki/desktop/button/click/1.1.0
npm run build-modified
is able to detect any current.yaml files that have modified by checking their git status. Before staging the modified schema in Git, run this to materialize all modified current.yaml files:
npm run build-modified
> schemas-event-secondary@1.0.0 build-modified /home/user/my_schema_repository
> jsonschema-tools materialize-modified -G
[2022-03-21 13:59:30.321 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2022-03-21 13:59:30.380 +0000]: Materializing /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
[2022-03-21 13:59:30.385 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.1.0 using schema base URIs ./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
[2022-03-21 13:59:30.405 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml.
[2022-03-21 13:59:30.407 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json.
[2022-03-21 13:59:30.409 +0000]: Created latest symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.yaml -> 1.1.0.yaml.
[2022-03-21 13:59:30.409 +0000]: Created latest symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.json -> 1.1.0.json.
[2022-03-21 13:59:30.409 +0000]: Created extensionless symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0 -> 1.1.0.yaml.
[2022-03-21 13:59:30.411 +0000]: Created latest symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest -> 1.1.0.yaml.
[2022-03-21 13:59:30.411 +0000]: New schema files have been materialized. Adding them to git: /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.json
git add jsonschema/mediawiki/desktop/button/click/*
git commit -m '1.1.0 version of mediawiki/desktop/button/click'
Including sub schemas
When materializing schemas, jsonschema-tools will dereference any $ref
pointers and merge any allOf
it finds. This allows us to DRY up common subschemas to avoid copy/paste bugs. It also allows us to standardize and reuse common fields, e.g. these MediaWiki entity fragment schemas.
For WMF, all event schemas should have a $schema
event field, as well as use a common event meta sub object. The Wikimedia common schema is in the https://schema.wikimedia.org/#!/primary/jsonschema primary schema repository] at /fragment/common.
In our example schema repository, assume we have a common schema at jsonschema/fragment/common/2.0.0 as:
title: fragment/common
description: Common schema fields for event schemas
$id: /fragment/common/2.0.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
additionalProperties: false
required:
- $schema
- meta
- dt
properties:
$schema:
description: >
A URI identifying the JSONSchema for this event. This should match an
schema's $id in a schema repository. E.g. /schema/title/1.0.0
type: string
dt:
description: >
ISO-8601 formatted timestamp of when the event occurred/was generated in
UTC), AKA 'event time'. This is different than meta.dt, which is used as
the time the system received this event.
type: string
format: date-time
maxLength: 128
meta:
type: object
required:
- stream
properties:
domain:
description: Domain the event or entity pertains to
type: string
minLength: 1
dt:
description: 'Time the event was received by the system, in UTC ISO-8601 format'
type: string
format: date-time
maxLength: 128
id:
description: Unique ID of this event
type: string
request_id:
description: Unique ID of the request that caused the event
type: string
stream:
description: Name of the stream (dataset) that this event belongs in
type: string
minLength: 1
uri:
description: Unique URI identifying the event or entity
type: string
format: uri-reference
maxLength: 8192
We want to include this schema (including its required
properties) in our button/click example schema. Let's make a new version of this schema and include it using $ref
. Edit jsonschema/mediawiki/desktop/button/click/current.yaml
to
title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.2.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
- $ref: /fragment/common/2.0.0
properties:
button_name:
type: string
description: Name of the button that was clicked
page_title:
type: string
description: Page the button appeared on when clicked
user_id:
type: string
description: ID of the user
examples:
- {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click", "id": "12345678-1234-5678-1234-567812345678"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}
Notice that we've bumped the version number in $id
again to 1.2.0. Materialize and commit this new schema version.
npm run build-modified
# ...
git add ./jsonschema/mediawiki/desktop/button/click/*
git commit -m 'Using $ref to common in new version mediawiki/desktop/button/click 1.2.0'
...
The newly materialized ./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml
now has both our schema and the included common schema merged together
How this works
When jsonschema-tools encounters a $ref
, it will attempt to resolve it and then replace it with the resolved content. After dereferencing, anything allOf
is merged together with the top level schema fields to create a fully dereferenced and merged schema without any $ref
or allOf
keywords.
Absolute $ref
If the $ref
starts with a URI protocol (http:// or file://), it will attempt to load it as is.
$ref: https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/common/1.0.0
will load the content at that URL.
Relative to baseSchemaUris
.
jsonschema-tools can be configured (in .jsonschema-tools.yaml
with multiple baseSchemaUris
, the default of which is just the schemaBasePath
(in our case, ./jsonschema
). When a $ref
starts with a slash (/
), jsonschema-tools will iterate through each of the configured baseSchemaUris
, prepend the base URI to the $ref
value, and attempt to resolve it. If your baseSchemaUris: [./jsonschema, https://schema.wikimedia.org/repositories/primary/jsonschema/]
, jsonschema-tools will look for your $ref
path in both of those locations.
Testing schemas
jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean. We showed how to install these tests in the section above about Creating a New Schema Repository. These are mocha tests, so all we need to do is run npm test
. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.