Jump to content

Event Platform/Schemas

From Wikitech

Motivation and Overview

Event Schemas are essential for an Event Streaming Platform. They allow disparate continuously changing producers and consumers to reliably communicate with each other. By explicitly declaring the shape of data, schemas ease integration between various systems.

Schemas should be readily available for any producer or consumer code that might need it. Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Access of those schemas should be reliable and immutable for any given deployed service.

WMF uses JSON as our preferred in-flight data serialization format, and as such we have chosen* to use JSONSchema for our event schemas. Schema evolution is necessary to be able to reliably upgrade producer and consumer code, but unfortunately, JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file.

WMF has chosen to distribute schemas using Git. This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project. However, even though we use Git, we do not rely on Git history for schema versioning. Each schema version is an explicit static file in the schema repository. For more background, see RFC: Modern Event Platform: Schema Registry.

To make development of many schema versions files in git easier, WMF has developed the jsonschema-tools library. This tooling makes it easier for developers to design and evolve schemas dynamically while allowing production services can use static and immutable versions of those schemas.

jsonschema-tools will be used in the rest of this documentation to set up and develop schemas in a Git schema repository. Please skim the jsonschema-tools README before proceeding.

jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed. You can get NodeJS and npm at nodejs.org. Once installed, cd to the schema repository and run npm install. Heads-up: the full path to the directory cannot contain spaces. For example, ~/Documents/analytics\ engineering/event\ schemas/primary is likely to yield errors, but ~/Documents/analytics-engineering/event-schemas/primary would be fine.

*There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see RFC: Modern Event Platform - Choose Schema Tech, https://techblog.wikimedia.org/2020/09/10/wikimedias-event-data-platform-or-json-is-ok-too/ and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.

Event Schema Design Rules and Conventions

Event Platform/Schemas/Guidelines

Schema Repositories

A schema repository is a Git repository with a hierarchy of versioned JSONSchema files, with a file layout something like:

jsonschema
└── analytics
    ├── button
    │   ├── click
    │   │   ├── 1.0.0 -> 1.0.0.yaml
    │   │   ├── 1.0.0.yaml
    │   │   ├── current.yaml
    │   │   └── latest -> 1.0.0
    │   └── release
    │       ├── 1.0.0 -> 1.0.0.yaml
    │       ├── 1.0.0.yaml
    │       ├── 1.0.1 -> 1.0.1.yaml
    │       ├── 1.0.1.yaml
    │       ├── current.yaml
    │       └── latest -> 1.0.1
    └── page_preview
        └── visibility_change
            ├── 1.0.0 -> 1.0.0.yaml
            ├── 1.0.0.yaml
            ├── 2.0.0 -> 2.0.0.yaml
            ├── 2.0.0.yaml
            ├── current.yaml
            └── latest -> 2.0.0

JSONSchema has title and $id fields that we use to associate event data with a schema, as well as for semantically versioning schemas. The actual hierarchy layout shown here is arbitrary, but each schema's title and $id must match the layout in a specific way. More on this below.

Note the 'current.yaml' files. These files represent the current working version of the schema. The current schemas are never themselves used as a schema for validation or data integration. Instead, they are 'materialized' by jsonschema-tools into static versioned schema files. These versioned schema files are the canonical schemas used by event processing systems.

Hierarchy Rules

Each schema's title should match its relative path in the schema repository. E.g. all schema version files in namespace1/entity1/verbB should have title: namespace1/entity1/verbB. Each schema's $id field should be set to the path (starting with /) and (extensionless) version. E.g. namespace1/entity1/verbB/1.0.1.yaml should have $id: /namespace1/entity1/verbB/1.0.1.

This layout combined with the title and $id allow for event data to specifically point to their schemas via relative URIs. By semantically versioning schema files, jsonschema-tools is able to associate schemas with the same title and enforce backwards compatibility. The relative and versioned $id URIs can also be used as JSON $ref links and with JSON Pointers. More on this below as well.

Creating a new schema repository

Most likely you will already be working with a schema repository. If so, skip to Creating a new schema or Modifying schemas.

jsonschema-tools is a NodeJS library and CLI for managing JSONSchema Git repositories. To create a new schema repository, you'll create a package.json file, install and configure jsonschema-tools, and set up jsonschema-tools tests.

mkdir my_schema_repository
cd my_schema_repository
git init .

# Our schemas will go in the jsonschema/ directory
mkdir jsonschema

# Create a configuration file for jsonschema-tools.
echo -e 'schemaBasePath: ./jsonschema/\nlogLevel: info' > .jsonschema-tools.yaml

# Create a package.json file.  (Modify this as desired.)
echo '
{
  "name": "my_schema_repository",
  "scripts": {
    "test": "mocha test/jsonschema",
    "build-modified": "jsonschema-tools materialize-modified --no-git-add",
    "build-new": "jsonschema-tools materialize"
},
  "devDependencies": {
    "@wikimedia/jsonschema-tools": "^0.6.0",
    "mocha": "^6.2.0"
  }
}
' > package.json

# Install jsonschema-tools.
npm install .

# Install jsonschema-tools tests.
mkdir -p test/jsonschema
echo "
'use strict';
require('@wikimedia/jsonschema-tools').tests.all({ logLevel: 'info' });
" > test/jsonschema/repository.test.js

# Create the first git commit.
echo 'node_modules**' >> .gitignore
git add .
git commit -m 'New schema repository'

Creating a new schema

Once you are working in a repository with jsonschema-tools, we can create new schemas. By 'new schema', we mean a brand new schema lineage, not just a new schema version. To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema. For this example, we'll create a new event schema that represents a Mediawiki UI button click.

NOTE: since will be writing JSONSchema, you should probably know how to do that. See this tutorial and reference for help working with JSONSchema.

mkdir -p jsonschema/mediawiki/desktop/button/click

Open jsonschema/mediawiki/desktop/button/click/current.yaml. We'll build this up piece by piece and explain each part.

Schema metadata

First we need some schema metadata that describe and identify the schema. Note that this schema metadata is not describing any aspect of your event data.

# This is the title of the schema.
# It should match the relative path to this file's parent directory.
title: mediawiki/desktop/button/click

# Document the what the schema represents.
description: Mediawiki desktop web button clicked

# The $id uniquely identifies this schema.  It should be a versioned (and extensionless) URI.
$id: /mediawiki/desktop/button/click/1.0.0

# This is the meta-schema of this schema.  This should probably always be the same
# for every schema, and should point to the main JSONSchema meta-schema at json-schema.org.
$schema: https://json-schema.org/draft-07/schema#


Event fields

...continuing on to event data fields. Your event should be a JSON object with each field explicitly declared here.

type: object
additionalProperties: false
properties:

Required event data

In addition to the $schema field, WMF has defined common fields for event data. These common fields allow us to have some consistency all event data, and are also used to support backend functionality (deduplication, Hive table ingestion, etc.)

$schema

Each event needs to identify its schema. Right now we are just writing the schema, but later on your code will produce JSON event data that conforms to this schema. We need to be able to look up the schema for any given event just from the event data itself. To do this, we re-use the JSONSchema $schema field in the event properties.

  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be
      a short URI containing only the name and version at the end of the
      URI path.  e.g. /schema_name/1.0.0 is acceptable. This should match
      the schema's $id field.
Timestamps: meta.dt and dt

These timestamps have different semantics, but in most cases they will be very close, if not the same. These are both ISO-8601 UTC datetime strings, e.g. '2020-07-01T00:00:00Z'.

Every event happens at a certain date-time. That event time should be stored in the dt field.

meta.dt can be used as the event ingestion time, i.e. the time at which the intake system has received the event. Depending on the pipeline your event is flowing through, this might be set be different levels. For events that are received first by our intake service (EventGate), this will be set by it, if it is not already set by the client.

NOTE: meta.dt will be used as the Kafka timestamp as well as for Hive hourly partitioning. If you don't have strict control over your event producers (e.g. remote browser clients), you should allow EventGate to fill in this field so that you don't end up with incorrect timestamps.

NOTE: As of 2020-12, these meta.dt and dt conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See T240460 and T267648.

meta.stream

Every event should belong to a named dataset. While events are in flight, this dataset is called a stream of events. Each event needs to specify which stream it belongs to. For example, the resource_change schema is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams. We do this using the meta.stream field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table. In most cases, the Kafka topic will be the stream name prefixed with the datacenter name where the event was received.)

There are a few more common and optional meta fields that WMF defines, but we don't need explain them all here. For now we will write out just these 2 example meta fields. Later we will show how to include the event meta schema using $ref.

  ### Metadata object.  All events schemas should have this.
  meta:
    type: object
    properties:
      dt:
        type: string
        # Whenever a format is used on a field, we require that maxLength is also set.
        # See https://github.com/epoberezkin/ajv#security-risks-of-trusted-schemas
        format: date-time
        maxLength: 128
        description: Time stamp of the event, in ISO-8601 format
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
    required:
      - dt
      - stream

Event data fields

Finally we can add any fields that we really want our event to have.

  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked

The new schema

Here is the new schema we just wrote:

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be
      a short URI containing only the name and version at the end of the
      URI path.  e.g. /schema_name/1.0.0 is acceptable. This often will
      (and should) match the schema's $id field.
  ### Metadata object.  All events schemas should have this.
  meta:
    type: object
    properties:
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: Time stamp of the event, in ISO-8601 format
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
    required:
      - dt
      - stream
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked

examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser"}

Note the examples. This is optional, but can be nice if you want to give schema readers an example of what you expect event data to look like. Notice how the event's $schema matches exactly the schema's $id.

Materializing the schema

jsonschema-tools calls the process of derefencing, merging and generating the static versioned files 'materializing'. So far, we've saved this our new schema as ./jsonschema/mediawiki/desktop/button/click/current.yaml. current.yaml will be the 'current working copy' of a schema. It can contain $ref URI pointers (more on this below). Any changes we make to schemas should always be done on their current.yaml files. We'll use jsonschema-tools to materialize current.yaml into a statically versioned schema file.

WMF's schema repositories are set up with npm scripts to help materialize schemas. (These scripts are just wrappers of the jsonschema-tools CLI).

To materialize a new schema, you'll run npm run build-new:

# materialize the new current.yaml schema
npm run build-new ./jsonschema/mediawiki/desktop/button/click/current.yaml

[2022-03-21 13:57:14.397 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.0.0 using schema base URIs ./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
[2022-03-21 13:57:14.424 +0000]: Materialized schema at jsonschema/mediawiki/desktop/button/click/1.0.0.json.
[2022-03-21 13:57:14.425 +0000]: Materialized schema at jsonschema/mediawiki/desktop/button/click/1.0.0.yaml.
[2022-03-21 13:57:14.425 +0000]: Created latest symlink jsonschema/mediawiki/desktop/button/click/latest.json -> 1.0.0.json.
[2022-03-21 13:57:14.426 +0000]: Created latest symlink jsonschema/mediawiki/desktop/button/click/latest.yaml -> 1.0.0.yaml.
[2022-03-21 13:57:14.426 +0000]: Created extensionless symlink jsonschema/mediawiki/desktop/button/click/1.0.0 -> 1.0.0.yaml.
[2022-03-21 13:57:14.427 +0000]: Created latest symlink jsonschema/mediawiki/desktop/button/click/latest -> 1.0.0.yaml.

# Git add the new current.yaml schema and the materialized schema files.
git add ./jsonschema/mediawiki/desktop/button/click/*
git commit -m 'Created mediawiki/desktop/button/click 1.0.0 schema'

The version to materialize will be obtained from the value of $id in current.yaml. Both yaml and json (by default) files will be materialized, and the versioned extensionless symlink will point to the versioned yaml file (by default).

Alternatively you can manually materialize a schema using the jsonschema-tools CLI. See $(npm bin)/jsonschema-tools --help for more information.

Modifying schemas

Versioned schemas should be (mostly) immutable. Once committed and merged, they may be used by many active producers and consumers. Changing an existent version should not be done (if you think you need to do it, get in touch with the Analytics or Core Platform Engineering teams). Instead, to modify a schema you should just create a new backwards compatible version.

Let's add a user_id to our event data. Edit jsonschema/mediawiki/desktop/button/click/current.yaml and add the following at the bottom of the schema.

# ...
  user_id:
    type: string
    description: ID of the user

# Add a user_id onto our examples field too:
examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}

Since we've changed the schema, we MUST manually change the version in the schema's $id field. According to semantic versioning, our addition of the user_id field should be a minor version increment. So change $id to:

$id: /mediawiki/desktop/button/click/1.1.0

npm run build-modified is able to detect any current.yaml files that have modified by checking their git status. Before staging the modified schema in Git, run this to materialize all modified current.yaml files:

npm run build-modified
> schemas-event-secondary@1.0.0 build-modified /home/user/my_schema_repository
> jsonschema-tools materialize-modified -G

[2022-03-21 13:59:30.321 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2022-03-21 13:59:30.380 +0000]: Materializing /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
[2022-03-21 13:59:30.385 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.1.0 using schema base URIs ./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
[2022-03-21 13:59:30.405 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml.
[2022-03-21 13:59:30.407 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json.
[2022-03-21 13:59:30.409 +0000]: Created latest symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.yaml -> 1.1.0.yaml.
[2022-03-21 13:59:30.409 +0000]: Created latest symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.json -> 1.1.0.json.
[2022-03-21 13:59:30.409 +0000]: Created extensionless symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0 -> 1.1.0.yaml.
[2022-03-21 13:59:30.411 +0000]: Created latest symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest -> 1.1.0.yaml.
[2022-03-21 13:59:30.411 +0000]: New schema files have been materialized. Adding them to git: /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/latest.json

git add jsonschema/mediawiki/desktop/button/click/*
git commit -m '1.1.0 version of mediawiki/desktop/button/click'

Including sub schemas

When materializing schemas, jsonschema-tools will dereference any $ref pointers and merge any allOf it finds. This allows us to DRY up common subschemas to avoid copy/paste bugs. It also allows us to standardize and reuse common fields, e.g. these MediaWiki entity fragment schemas.

For WMF, all event schemas should have a $schema event field, as well as use a common event meta sub object. The Wikimedia common schema is in the https://schema.wikimedia.org/#!/primary/jsonschema primary schema repository] at /fragment/common.

In our example schema repository, assume we have a common schema at jsonschema/fragment/common/2.0.0 as:

title: fragment/common
description: Common schema fields for event schemas
$id: /fragment/common/2.0.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
additionalProperties: false
required:
  - $schema
  - meta
  - dt
properties:
  $schema:
    description: >
      A URI identifying the JSONSchema for this event. This should match an
      schema's $id in a schema repository. E.g. /schema/title/1.0.0
    type: string
  dt:
    description: >
      ISO-8601 formatted timestamp of when the event occurred/was generated in
      UTC), AKA 'event time'. This is different than meta.dt, which is used as
      the time the system received this event.
    type: string
    format: date-time
    maxLength: 128
  meta:
    type: object
    required:
      - stream
    properties:
      domain:
        description: Domain the event or entity pertains to
        type: string
        minLength: 1
      dt:
        description: 'Time the event was received by the system, in UTC ISO-8601 format'
        type: string
        format: date-time
        maxLength: 128
      id:
        description: Unique ID of this event
        type: string
      request_id:
        description: Unique ID of the request that caused the event
        type: string
      stream:
        description: Name of the stream (dataset) that this event belongs in
        type: string
        minLength: 1
      uri:
        description: Unique URI identifying the event or entity
        type: string
        format: uri-reference
        maxLength: 8192

We want to include this schema (including its required properties) in our button/click example schema. Let's make a new version of this schema and include it using $ref. Edit jsonschema/mediawiki/desktop/button/click/current.yaml to

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.2.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
- $ref: /fragment/common/2.0.0
properties:
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked
  user_id:
    type: string
    description: ID of the user

examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click", "id": "12345678-1234-5678-1234-567812345678"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}

Notice that we've bumped the version number in $id again to 1.2.0. Materialize and commit this new schema version.

npm run build-modified
# ...

git add ./jsonschema/mediawiki/desktop/button/click/*
git commit -m 'Using $ref to common in new version mediawiki/desktop/button/click 1.2.0'
...

The newly materialized ./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml now has both our schema and the included common schema merged together

How this works

When jsonschema-tools encounters a $ref, it will attempt to resolve it and then replace it with the resolved content. After dereferencing, anything allOf is merged together with the top level schema fields to create a fully dereferenced and merged schema without any $ref or allOf keywords.

Absolute $ref

If the $ref starts with a URI protocol (http:// or file://), it will attempt to load it as is.

$ref: https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/common/1.0.0 will load the content at that URL.

Relative to baseSchemaUris.

jsonschema-tools can be configured (in .jsonschema-tools.yaml with multiple baseSchemaUris, the default of which is just the schemaBasePath (in our case, ./jsonschema). When a $ref starts with a slash (/), jsonschema-tools will iterate through each of the configured baseSchemaUris, prepend the base URI to the $ref value, and attempt to resolve it. If your baseSchemaUris: [./jsonschema, https://schema.wikimedia.org/repositories/primary/jsonschema/], jsonschema-tools will look for your $ref path in both of those locations.

Testing schemas

jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean. We showed how to install these tests in the section above about Creating a New Schema Repository. These are mocha tests, so all we need to do is run npm test. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.