Data Platform/Systems/DataHub

We run an instance of DataHub which acts as a centralized data catalog, intended to facilitate the following:
- Discovery by potential users of the various data stores operated by WMF.
- Documentation of the data structures, formats, access rights, and other associated details.
- Governance of these data stores, including details of retention, sanitization, recording changes over time.
Accessing DataHub
Frontend
The URL for the web interface for DataHub is: https://datahub.wikimedia.org
Access to this service requires a Wikimedia developer account and access is currently limited to members of the wmf
or nda
LDAP groups. Authentication will be performed by the CAS-SSO single-sign-on system.
Generalized Metadata Service
The DataHub Generalized Metadata Service (GMS) is: https://datahub-gms.svc.eqiad.wmnet:30443
The GMS service is not public facing and it is only available from our private networks. Currently we have not enabled authentication on this interface, although it is planned.
Via the CLI
The datahub
CLI can be used to interact with the Datahub API, from one of the stat
hosts. To to this, ssh onto one of these hosts, say stat1004.eqiad.wmnet
, and run the following commands to install datahub (bypass them if you already have it installed)
cat << EOF > ~/.datahubenv
gms:
server: https://datahub-gms.svc.eqiad.wmnet:30443
token: ''
EOF
set_proxy
source /opt/conda-analytics/etc/profile.d/conda.sh
conda-analytics-clone datahub-env
conda activate datahub-env
pip install acryl-datahub
Once you have acryl-datahub
installed in your activated conda environment, run the following commands to use it:
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
datahub get --urn 'urn:li:dataset:(urn:li:dataPlatform:kafka,MetadataChangeEvent_v4,PROD)' # should work!
Accessing the Staging Instance
Staging instance is accessible via https://datahub.wikimedia.org
Related Pages
Service Overview
DataHub Components
The DataHub instance is composed of several components, each built from the same codebase
- a metadata server (or GMS)
- a frontend web application
- an mce consumer (metadata change event)
- an mae consumer (metadata audit event)
All of these components are stateless and currently run on the Wikikube Kubernetes clusters.
Their containers are built using the Deployment pipeline and the configuration for this is in the wmf branch of our fork of the datahub repository:
Backend Data Tiers
The stateful components of the system are:
- a MariaDB database on the analytics-meta database
- an Opensearch cluster running on three VMs named
datahubsearch100[1-3].eqiad.wmnet
- an instance of Karapace, which acts as a schema registry
- a number of Kafka topics
Our Opensearch cluster fulfils two roles of:
- a search index
- a graph database
The design document for the DataHub service (restricted to WMF staff).
We had previously carried out a Data Catalog Application Evaluation and subsequently the decision was taken to use DataHub and to implement an MVP deployment.
Metadata Sources
We have several key sources of metadata.
Ingestion
Currently ingestion can be performed by any machine on our private networks, including the stats servers.
Automated Ingestion
We automatically ingest some metadata using Airflow. This includes Hive (every database is ingested separately), Druid, and Kafka Jumbo. We also ingest the Event Platform as its own data platform.
Manual Ingestion Example
The following procedure should help to get started with manual ingestion.
- Select a stats server for your use.
- Activate a conda environment.
- Configure the HTTP proxy servers in your shell (run
set_proxy
) - Install the necessary python modules
# it's very important to install the same version CLI as server that's running, # otherwise ingestion will not work pip install acryl-datahub==0.10.4 datahub version datahub init server: https://datahub-gms.svc.eqiad.wmnet:30443
Then create a recipe file, install more plugins if required. And run by adding this certificate in the environment:
REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt datahub ingest -c recipe.yaml
Some examples of recipes, including Hive, Kafka, and Druid are available on this ticket
Data Lineage
We have incomplete lineage for some ingested systems.
Airflow
Airflow DAGs currently do not have lineage.
DatahubLineageEmitterOperator
We have a custom Airflow operator to emit lineage between datasets that are defined in datasets.yaml
.
Event Platform
There is partial lineage that reflects the connection between Kafka topics and its associated event stream. This is automatically created when ingesting Kafka Jumbo and matching the topics with its event stream configuration and jsonschemas.
Spark
We use DataHub's Spark integration to create column-level lineage between hive tables. Since we run Spark jobs via Airflow, we have a parameter on the Spark operator to automatically set the variables needed to enable it. See Airflow's Spark DataHub Lineage section for developer docs.
Configuration is set on an instance-by-instance basis. An example of this can be seen in dag_config.py
.
A successful lineage run will generate two entities in DataHub: a Spark pipeline and a Spark task belonging to the pipeline. A Spark pipeline represents the entire Spark job and has an appId
. A Spark task is created per unique Spark query execution and has a jobId
. Lineage is attached to the task and not the pipeline.
Gotchas to keep in mind
- Because the integration is implemented as a Spark Listener, it fails silently.
- Since it's a Spark listener, where it is executed depends on the driver. Since it is not aware of where it is, it must somehow get the correct Kafka URL from Airflow before the Spark Operator is executed. This is currently an open issue.
- Note that this is purely Spark and not aware of Airflow. An Airflow DAG with multiple consecutive Spark operators with lineage enabled will generate separate Spark pipelines in DataHub.
- It assigns Hive-like tables to the Hive data platform. It's not tested, but that might mean that Spark jobs that touch Druid/Cassanda will generate incorrect lineage.
- Until we upgrade Spark/Iceberg, it does not work with Spark jobs that touch Iceberg tables.
Incomplete list of systems with no lineage coverage
- Airflow (the dags themselves) (?)
- Gobblin
- Sqoop
- Flink
- Spark with Iceberg tables
Operations
Administration
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Administration