Analytics/Archive/Oozie

Oozie has now been deprecated and removed from our systems, so the following information is only retained for historical purposes.

The replacement for Oozie is Airflow

Apache Oozie is a workflow scheduler for Apache Hadoop jobs.

It has some fancy features: most relevantly, jobs may be scheduled based on the existence of data in the cluster. This allows jobs to be scheduled to be run not based only on a current timestamp, but for when the data needed to run a particular job is available.

You can access Oozie via oozie command line client or via Hue at hue.wikimedia.org/hue/jobbrowser/.

As of May 2021, we are running Oozie 4.3.0; the docs for this version are at oozie.apache.org/docs/4.3.0/.

Key terms

action: An action generally represents a single step in a job workflow. Examples include pig scripts, failure notifications, map-reduce jobs, etc.

workflow: Workflows are used to chain actions together. A workflow is synonymous with a job. Workflows describe how actions should run, and how actions should flow. Actions can be chained based on success and failure conditions.

coordinator: Coordinators are used to schedule recurring runs of workflows. They can abstractly describe input and output datasets based on periodicity. Coordinators will submit workflow jobs based on the existence of data.

bundle: A bundle is a logical grouping of coordinators that share commonalities. You can use bundles to start and stop whole groups of coordinators at once.

Description

Let's run through a simple example of how to set up runs of an Oozie job running on the cluster. Have in mind that this is an example just to get you started and that more steps than the ones outlined below are needed to get a job running according to our production standards.

In our example we will run via Oozie a job that runs a parametized hive query. Note that we assume you have access to stat1004.eqiad.wmnet from which we normally access the Analytics Cluster.

Something to know is that Oozie overrides hive default and forces it to use only 1 reducer for its map-reduce jobs. If you want to let hive decide on the number of reducers it should use (default behavior), then explicitly set mapred.reduce.tasks to -1 (see below in the workflow.xml file)

Basic example

The job we are going to set up will use just a workflow, this means that—in the absence of a coordinator—we will run it by hand using the Oozie command line interface (CLI). There is a lot you can do through Oozie's CLI, please take a look at Oozie's CLI docs.

There are three files needed:

A file with the Hive query.
A workflow.xml file that Oozie is going to use to see what job to run
A workflow.properties file that sets concrete values for the properties defined in the workflow.

Both workflow.xml and the Hive query need to be available inside HDFS, so we will be putting them into HDFS /tmp directory and the Oozie job will run from there.

Hive query

Please note placeholders for parameters and replace <user> with your username.

 DROP VIEW IF EXISTS <user>_oozie_test;
 CREATE VIEW <user>_oozie_test AS
 SELECT
 CASE WHEN user_agent LIKE('%iPhone%') THEN 'iOS'
  ELSE 'Android' END AS platform,
 parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID') AS uuid
    FROM ${source_table}
    WHERE year=${year}
        AND month=${month}
        AND day=${day}
        AND hour=${hour};

-- Now get a count of totals that will be inserted in some file
INSERT OVERWRITE DIRECTORY "${destination_directory}"
SELECT platform, COUNT(DISTINCT(uuid))
FROM <user>_oozie_test
GROUP BY platform;

Workflow.xml and workflow.properties

<workflow-app name="cmd-param-demo" xmlns="uri:oozie:workflow:0.4">
    <parameters>
        <property>
            <name>queue_name</name>
            <value>default</value>
        </property>

        <!-- Required properties -->
        <property><name>name_node</name></property>
        <property><name>job_tracker</name></property>

        <property>
            <name>hive_site_xml</name>
            <description>hive-site.xml file path in HDFS</description>
        </property>
        <!-- specifying parameter values in file to test running -->
        <property>
            <name>source_table</name>
            <description>
                Hive table to read data from.
            </description>
        </property>
        <property>
            <name>year</name>
            <description>The partition's year</description>
        </property>
        <property>
            <name>month</name>
            <description>The partition's month</description>
        </property>
        <property>
            <name>day</name>
            <description>The partition's day</description>
        </property>
        <property>
            <name>hour</name>
            <description>The partition's hour</description>
        </property>
    </parameters>

    <start to="hive-demo"/>
    <action name="hive-demo">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${job_tracker}</job-tracker>
            <name-node>${name_node}</name-node>
            <job-xml>${hive_site_xml}</job-xml>
            <configuration>
                <property>
                    <name>mapreduce.job.queuename</name>
                    <value>${queue_name}</value>
                </property>
                <property>
                    <name>hive.exec.scratchdir</name>
                    <value>/tmp/hive-${user}</value>
                </property>
                <!--Let hive decide on the number of reducers -->
                <property>
                    <name>mapred.reduce.tasks</name>
                    <value>-1</value>
                </property>
            </configuration>
            <script>generate_daily_uniques.hql</script>

            <param>source_table=${source_table}</param>
            <param>destination_directory=/tmp/test-mobile-apps/${wf:id()}</param>
            <param>year=${year}</param>
            <param>month=${month}</param>
            <param>day=${day}</param>
            <param>hour=${hour}</param>
        </hive>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Worflow.properties:

@stat1004:~/workplace/refinery/oozie/mobile-apps/generate_daily_uniques$ more workflow.properties
name_node                         = hdfs://analytics-hadoop
job_tracker                       = resourcemanager.analytics.eqiad.wmnet:8032
queue_name                        = default
oozie_directory                   = ${name_node}/wmf/refinery/current/oozie
# for testing locally, this won't work:
# hive_site_xml                   = ${oozie_directory}/util/hive/hive-site.xml
hive_site_xml                     = ${refinery_directory}/oozie/util/hive/hive-site.xml
# Workflow app to run.
oozie.wf.application.path         = hdfs://analytics-hadoop/tmp/tests-<some>/workflow.xml
oozie.use.system.libpath          = true
oozie.action.external.stats.write = true
# parameters
source_table = wmf_raw.webrequest
year = 2014
month = 11
day = 20
hour = 10
user = <your-user-on-stat1007>

Validating worflow.xml

After creating the files you should make sure they are valid according to oozie's schema:

oozie validate workflow.xml

Moving files to hdfs

The easiest place to put stuff is the /tmp directory. You should move the workflow.xml and Hive file there.

hdfs dfs -mkdir /tmp/tests-$USER
hdfs dfs -put workflow.xml /tmp/tests-$USER/workflow.xml
hdfs dfs -cat /tmp/tests-$USER/workflow.xml

Running oozie job

From your local directory -where you have workflow.properties- run:

oozie job -config workflow.properties -run

In production, here is an example of a command to launch a job. from `an-launcher1002`:

 sudo -u analytics kerberos-run-command analytics \
 oozie job --oozie $OOZIE_URL \
 -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +%Y)* | tail -n 1 | awk '{print $NF}') \
 -Dqueue_name=production \
 -Dstart_time=2022-02-21T17:00Z \
 -config /srv/deployment/analytics/refinery/oozie/aqs/hourly/coordinator.properties \
 -run

Running your job, say, once a day

In order to use oozie's crontab you will need a coordinator file. See good docs about coordinators.

Advanced example

The main difference with the 101 example is that in this case we will likely need to override the oozie directory. This testing needs to be done from stat1004/stat1007. Let's assume that your refinery code you want to test is deployed to ~/workplace/refinery/oozie

Rsync code to stat1004/stat1007

rsync -rva --delete ./oozie/ stat1007.eqiad.wmnet:~/oozie

Create tables if needed

If your oozie job accesses new tables you would need to create them in your db, rememeber you are working of your user space

hive -f blah.hql --database nuria

Put your oozie directory on hdfs

Make sure you are not spaming anyone with error e-mails, substitute any mentions of analytics-alerts on workflow.xml files

>find . -type f   -exec sed -i 's/analytics-alerts@wikimedia.org/your-email@wikimedia.org/' {} \;

>nuria@stat1004:~/some$ ls oozie
>hdfs dfs -rm -r /tmp/oozie-nuria ; hdfs dfs -mkdir /tmp/oozie-nuria; hdfs dfs -put oozie/ /tmp/oozie-nuria

Also, interestingly, you can notice that it is easy to overwrite oozie files onto HDFS, for two main reasons: using put -f to overwrite existing files, and by being in your local refinery folder while copying to an oozie folder in your HDFS home. This last bit makes it really easy for path autocompletions, since relative oozie are similar locally and on HDFS.

Finally when you want to test a job:
- Update your local repo to the patch you want (click download in the upper-right gerrit window, get the command you need, copy/paste in while being in your local refinery folder)
- Update your HDFS folder replacing (or creating) only the oozie subfolder containing the jobs you want to test. In our example, let's imagine it is oozie/pageview/hourly (99% of the time, given our conventions, testing one oozie job means updates in one and only one oozie subfolder):

hdfs dfs -put -f oozie/pageview/hourly oozie/pageview

- When lauching your job for test, add -Doozie_directory=hdfs://analytics-hadoop/user/YOUR_LOGIN/oozie in your command line. Since by convention we define the oozie root folder as oozie_directory in every job we have, oozie will use your HDFS folder (in which you have put your patch !) as a base for the job :)

Run oozie job overriding what pertains

Note that we are overriding the refinery-directory variable

oozie job -run -Duser=nuria -Darchive_directory=hdfs://analytics-hadoop/tmp/nuria -Doozie_directory=/tmp/oozie-nuria/oozie -config ./oozie/pageview/hourly/coordinator.properties  -Dstart_time=2015-09-03T00:00Z -Dstop_time=2015-09-03T04:00Z -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/2015* | tail -n 1 | awk '{print $NF}')

Datasets

When defining datasets, it's important to understand how the initial-instance property works. If a dataset is setting this from a coordinator parameter, like ${start_time}, then it won't be able to let you reference data *before* this point in time. So, it is better to hard-code the initial-instance to the first date that the data is available. It is ok if older data is deleted on an ongoing basis, the job should be careful to request only existing data, but the dataset can keep the initial instance definition to allow going back as far as needed.

Troubleshooting

No logs

Setting up debug mode might work:

https://oozie.apache.org/docs/4.0.1/DG_CommandLineTool.html#Debug_Mode

Checking logs

oozie job -log <job_id>

Seeing the last-run jobs

oozie jobs -localtime -len 2

Investigating a failed job

First you need to get the Hadoop ID of the failed job, which is different from the Oozie ID:

oozie job -info {{oozie-job-id}}

The output will include an "Ext ID" column, which includes the Hadoop ID (prefaced with job_)


Job ID : 0005783-141210154539499-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : cmd-param-demo
App Path      : hdfs://analytics-hadoop/tmp/tests-mobile-apps/workflow.xml
Status        : KILLED
Run           : 0
.....
CoordAction ID: -

Actions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                                            Status    Ext ID                 Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0005783-141210154539499-oozie-oozi-W@:start:                                  OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0005783-141210154539499-oozie-oozi-W@hive-demo                                ERROR     job_1415917009743_45854 FAILED/KILLED40000 
------------------------------------------------------------------------------------------------------------------------------------
0005783-141210154539499-oozie-oozi-W@kill                                     OK        -                      OK         E0729
------------------------------------------------------------------------------------------------------------------------------------

So in this case, the Hadoop job ID is 1415917009743_45854.

You can then run the following command to get the full, extremely detailed jobs logs.

yarn logs --applicationId  application_1415917009743_45854

Also, you can use the web GUI: https://yarn.wikimedia.org/cluster/scheduler

Killing a job

Get job id from following command:

oozie jobs -filter status=RUNNING -jobtype=coord | grep <job-name>

oozie job -kill <id>

Note that the previous command uses job (singular); if you try to instead run oozie jobs -kill <id>, you will get an error as the jobs (plural) subcommand is meant for killing multiple jobs at once.

An example in production, on `an-launcher1002`:

 oozie jobs -jobtype coord | grep <your-job>

 sudo -u analytics kerberos-run-command analytics \
 oozie job --oozie $OOZIE_URL \
 -kill <your-job-id>

Check how the cluster is doing

You can see how the queues are being utilized here:

http://localhost:8088/cluster/scheduler

You'll have to set up the ssh tunnel as specified in Analytics/Cluster/Access

Check the Oozie version

When consulting Oozie's documentation, it can be helpful to know which version the cluster is using. You can get it by running this command on one of the Analytics clients:

$ oozie admin --version

"Oozie URL is not available"

If you get a "java.lang.IllegalArgumentException: Oozie URL is not available neither in command option or in the environment" error when trying to run an Oozie command, it's probably because your OOZIE_URL environmental variable is not set (this is likely to be the case in a Jupyter virtual environment). As of January 2021, the URL is http://an-coord1001.eqiad.wmnet:11000/oozie. You can either provide this when running the Oozie command, using the -oozie argument, or manually set it as an environmental variable.

More info

Naming Convention

Refinery oozie jobs follow a naming convention allowing to automate job restart (see below). This convention is based on folders in which jobs configuration files are stored. The base folder for the convention is, in refinery git repo, the oozie folder.

Each refinery top level job (no parent job) is named after its folder hierarchy with - instead of /, with either -bundle or -coord as a postfix. For instance, the webrequest/load/bundle.properties defined job is named webrequest-load-bundle.
Children jobs follow the same pattern, except they are postfixed with parameters information. For instance, coordinator jobs lauched by webrequest-load-bundle are webrequest-load-coord-upload, webrequest-load-coord-text, webrequest-load-coord-maps and webrequest-load-coord-misc.

How to see jobs scheduled to run

The bird's eye view over currently submitted Refinery Oozie jobs is depicted in the following diagram:

On stat1004 run

 oozie jobs -jobtype bundle -filter status=RUNNING

to see the 100 most recent, RUNNING bundles.

On stat1004 run

 oozie jobs -jobtype coordinator -filter status=RUNNING

to see the 100 most recent, RUNNING coordinators

On stat1004 run

 oozie jobs -jobtype wf -filter status=RUNNING

to see the 100 most recent, RUNNING workflows.

To consider more than the 100 jobs, add a -len option at the end. Like -len 2000 to get the 2000 most recent ones.

To not limit to RUNNING jobs, drop the -filter status=RUNNING from the command.

The source for the refinery's oozie production jobs can be found at https://phabricator.wikimedia.org/diffusion/ANRE/browse/master/oozie .

In the cluster, Oozie's job definitions can be found at /wmf/refinery/....

Never deploy jobs from the /wmf/refinery/current/..., always use one of refinery-variants that have a concrete time and commit in the directory name. Like /wmf/refinery/2015-01-09T12.39.20Z--2007cb8.

Administration

Documentation intended for analytics team on how to restart jobs in the cluster and such: Analytics/Cluster/Oozie/Administration