Event Platform/EventLogging legacy
EventLogging was Wikimedia's original analytics focused event data system. It used Draft-3 JSONSchemas on meta.wikimedia.org to validate incoming events.
Differences from EventLogging the legacy backend
The EventLogging extension was originally built as an all in one system to capture MediaWiki analytics events. It managed schemas, client side event submission, server side event validation and server side event ingestion (into e.g. MySQL). The Event Platform program was conceived to unify event collection for production and analytics events. EventLogging's tier 2 and analytics focus and breadth was not suitable to support this unification. Many of the features of WMF's Event Platform are the same as the legacy EventLogging system, but are more modular and scalable. From an instrumentation only perspective, it may not be clear why things have to be different, but there are good engineering reasons for all of these changes.
The EventLogging extension has been repurposed as a MediaWiki instrumentation event producer library only. On wiki schemas and backend validation are no longer supported by EventLogging.
EventLogging legacy | Event Platform | |
---|---|---|
Schema repositories | EventLogging schemas were stored as centralized wiki pages on metawiki, and all environments (development, beta, production, etc.) had to use this same schema repository. | Event Platform schema are in decentralized git repositories. (Analytics instrumentation schemas are in the schemas/event/secondary repository. Schema repositories are also readable at https://schema.wikimedia.org/#!/ ) |
Streams, not schemas | EventLogging schemas were single use. Each schema corresponded to only one instrumentation, and eventually only one downstream SQL table. | Event Platform schemas are like data types for a dataset. A realtime event data set is called an 'event stream' (or just 'stream' for shorthand). Each stream must specify its schema, and a schema may be used by multiple streams. |
Schema versions | EventLogging schema versions were wiki page revisions. Each event specified its schema name and revision. | Event Platform schemas are semantically versioned, and each event declares its schema and version in a $schema URI. |
Schema compatibility | Each EventLogging schema revision could change the schema in any way, which lead to backwards incompatible changes. | Event Platform schemas versions must be backwards compatible; i.e. only adding new optional fields is allowed. |
Stream config | None. Changes to the way events were emitted (like sampling rate) required a code deployment. | Streams are declared and configured in mediawiki-config and can be modified via a Backport window deployment. |
FAQ
If you are used to the old EventLogging system with metawiki schemas, the new system probably feels a little unfamiliar. There's plenty of documentation around Event Platform, but sometimes you just want to get things done.
How do I find schemas?
'Instrumentation' schemas are stored in the schemas/event/secondary git repository in gitlab. Instrumentation specific ones live in the jsonschema/analytics directory.
You can browse these on gitlab or at schema.wikimedia.org. schema.wikimedia.org is also a simple HTTP API serving the directory of schema files.
How do I edit schemas?
Schemas are now stored in git repositories, just like other code. WMF (as of 2021-03) uses gitlab for code review and for hosting git repositories. If you are new to gitlab, you can learn more at https://www.mediawiki.org/wiki/GitLab.
Schemas are now semantically versioned. To ease the task of creating new versions, we use a library called jsonschema-tools to help automate some tedious schema editing tasks. For the most part, you shouldn't have to worry about this. To edit a schema:
- edit its current.yaml file, e.g. jsonschema/analytics/mediawiki/session_tick/current.yaml and bump the version in the schema's
$id
. git commit
the current.yaml file. New schema version files will be automatically created for you.git push
your change up to gitlab for review.
Once merged, your schema will be automatically deployed to schema.wikimedia.org.
See also Event Platform/Schemas and Event Platform/Instrumentation How To#Evolving your schema
How do I create new schemas?
Create a current.yaml file in a directory path that matches the schema's title. E.g. analytics/cool_button_click should live at jsonschema/analytics/cool_button_click/current.yaml.
More detailed instructions are available at Event_Platform/Instrumentation_How_To#Creating_a_new_schema.
How do I produce event data?
To produce data to a stream, you must
- Declare a stream in Event Stream Config, as described in Event_Platform/Instrumentation_How_To#Stream_Configuration.
- To produce with the EventLogging extension, you must also register that stream for use by EventLogging, as described in Event_Platform/Instrumentation_How_To#Register_your_stream_for_use_by_EventLogging.
- To write code that produces the data using the EventLogging extension, you'll want to call the
mw.eventLog.submit()
function, as described in Writing MediaWiki instrumentation code using the EventLogging extension- NOTE: If you are producing using a 'legacy' schema, i.e. in
jsonschema/analytics/legacy
, you will use the oldmw.eventlog.logEvent()
ormw.track('event.___', ...)
API instead ofmw.eventLog.submit()
. - In this case, you must have an entry for your schema in your extension.json file specifying the versioned
$schema
URI to use. The old EventLogging backend expected this to be a metawiki revision, the new one expects it to be a versioned schema URI. Example:
- NOTE: If you are producing using a 'legacy' schema, i.e. in
"attributes": {
"EventLogging": {
"Schemas": {
"LegacySchema": "/analytics/legacy/legacyschema/1.0.0",
}
}
}
How do I query my data?
No changes here. Event data is available in Hive in the event
database. However, the tables are no longer named after the schema; they are named after the stream. See also Event_Platform/Instrumentation_How_To#Viewing_and_querying_events.
Migration to Event Platform
In FY2020-2021, The Analytics/Data Engineering team is collaborating with the Product/Data Infrastructure team to migrate these now 'legacy' EventLogging event streams to Event Platform components.
The Phabricator task tracking this migration is https://phabricator.wikimedia.org/T259163.
What does Event Platform 'migration' mean?
This refers specifically to the process of moving legacy EventLogging schemas off of meta.wikimedia.org and having clients POST events to EventGate. Completing this migration will allow us to decommission the SPOF EventLogging backend service, a brittle Hive ingestion pipeline, and reliance on metawiki for schema distribution.
The migration should be mostly transparent to you, unless you need to make schema changes. Read more for details.
How does this relate to the Metrics Platform?
Metrics Platform will provide an abstraction on top of Event Platform components that will standardize the way product teams build instrumentations and collect metrics on product usage. When ready, the Product Data Infrastructure team will want slowly re-instrument products to use Metrics Platform client libraries and schemas.
This legacy EventLogging -> Event Platform migration is separate from that. The process for Metrics Platform reinstrumentation will be the same whether or not a schema is 'legacy' or a new Event Platform based schema.
Is my schema legacy or not?
If you schema is or was ever stored on meta.wikimedia.org, it is a legacy schema.
Is my schema migrated or not?
If your schema is editprotected on meta.wikimedia.org, or if it exists in the schemas/event/secondary repository in jsonschema/analytics/legacy, it has been migrated to Event Platform.
Changes
Data
Legacy EventLogging data in Hive is 100% compatible with Event Platform. You shouldn't notice any changes with existent data fields in Hive.
Client IP addresses are no longer collected by default
This means that event data in Hive will not be geocoded. If your instrumentation relies on either client IPs or geocoded data, we need to manually include the client_ip field in the migrated schema. During the migration, an engineer will contact the legacy EventLogging schema owner to see if they need this data.
timestamp field semantics
The now deprecated eventlogging backend previously collected only a dt field, which was the time at which the backend received the event. Since the EventLogging extension only sends batches of events every 30 seconds, this timestamp could be within 30 seconds after the event actually happened.
Once a legacy EventLogging stream has been migrated to Event Platform, it will have the following timestamp fields:
- dt - server side receive time.
- meta.dt - server side receive time (unless the client explicitly sets this field). This field is used for hive hourly partitioning.
- client_dt - client side event timestamp. Since this can be set arbitrarily by clients, there is no restriction on what this value might be. Usually it should be the time at which the event happened, but a misbehaving client could set this to anything, including timestamps in the future.
NOTE: The timestamp field semantics are different than those for non-legacy Event Platform events. In non legacy events, dt is the client side event timestamp, and meta.dt is the server side receive timestamp. (As of 2020-11, This is still TODO for EventBus based streams.)
System
Schema Location
The main visible change here is the schema location. Schemas are no longer stored on meta.wikimedia.org. Instead, they are stored in the schemas/event/secondary repository. Migrated legacy EventLogging schemas are in the jsonschema/analytics/legacy directory.
The schemas will look slightly different from what you are used to seeing on meta.wikimedia.org. The old eventlogging backend system wrapped all on wiki schemas with the EventCapsule schema. Event Platform has some required fields. The EventCapsule and the required Event Platform fields are now included directly in migrated schemas.
If you need to make schema changes, you will now do so in the schemas/event/secondary repository. You can read more about how to do this at Event_Platform/Schemas#Modifying_schemas and Event_Platform/Instrumentation_How_To#Evolving.
Automatically augmented event data
See: Event_Platform/Schemas/Guidelines#Automatically_populated_fields.
Backend
Events are now POSTed to an EventGate instance instead of using a URL encoded GET query parameter. This means that browser clients that don't support JavaScript will not be able to send events.
Frontend
The main producer of this legacy data is the MediaWiki EventLogging extension. This extension has been modified to be able to produce the legacy data to EventGate via a config switch. The engineers doing this migration will not modify any frontend instrumentation code. If you'd like your instrumentation to fully move to Event Platform, you'll need to create new schemas and instrumentation code that calls the mw.eventLog.submit() function, rather than the now deprecated mw.eventLog.logEvent() function. However, this will result in a totally new event stream and Hive table, i.e. a brand new instrumentation stream. This is not required of any legacy EventLogging event streams, but it is nice if you want to do this. :)
Other questions?
If you are still confused or have more questions...then we need to know and do a better job at documentation! Please reach out to Andrew Otto (IRC: ottomata, email: otto@wikimedia.org) and Marcel Forns (IRC: mforns, email: mforns@wikimedia.org) with any questions. We're happy to answer and will use your questions to help make this documentation better.