Jump to content

Analytics/Archive/AQS - DataStore

From Wikitech
This page contains historical information. It may be outdated or unreliable.

Alpha (?) Version

Analytics/PageviewAPI/RESTBase


Load Test

TL:DR

We dropped hourly resolution and went to daily with Cassandra.

It is literally not possible to store hourly resolution with the number of dimensions we have, it represent too much data (with 3 10T nodes, we would have had between 2 and 3 month worth of data). In addition, we have been told that Cassandra has serious problems when storage of one box goes over 800G (we have not really experienced it though, not going over that amount of data). Also, from a load perspective, parallel loading job fails frequently and makes us backfill very slowly

We tested sizes for hourly resolution in both Druid And ElasticSerach, both systems do not require pre computation of dimensions thus storage is more efficient, they pretty much come even when it comes to space. Druid can store one day of hourly resolution (a 35G file) in 15G, elastic search in 22G. It was noticeable faster to load data into ElasticSearch. Now, data is loaded into Druid using a local hadoop cluster, using a real hadoop cluster our loading time of 40 hours for 1 day can probably be reduced to 1 hour (estimate by Druid team).

There are additional considerations when it comes to the backend solution, Druid could be set up out of the box with a UI ready to be queried by internal clients. If we are to use ElasticSearch we will need to build a query UI. This is a secondary concern for PageviewAPI as we are so far only providing an endpoint.

Next steps

PageviewAPI will be on top of Cassandra for the next months.

We will be testing a druid cluster with production hardware and, based on those results, decide on whether we migrate the PageviewAPI storage from Cassandra to Druid, and at which resolution.

See TT: https://phabricator.wikimedia.org/T117115 (cassandra moving to multi instance, which avoids 800G limit)

Cassandra

* Boxes: 3 x Dell PowerEdge R720 (12 cores, 48G RAM, 12*2T disk, see Analytics/Cluster/Hardware)
* Installation: Pupettization using RestBase original setup
Results
  • First trial - Load using Hadoop, hourly and daily precomputed data, 1 row per value set (project, article, timestamp, access, agent, view_count)
    • Data size: ~2.4G (1 hour) * 24 + 8G (1 day) = ~65G daily (24 * 240M (hour) + 800M (day) = 6.5G gzipped)
    • Parallel load - 2 * hourly + 1 daily at a time - Too much for cassandra, regular job failures and cassandra restarts needed
    • Data size in cassandra: ~1.4T after ~10days - 140G / day - Unexpected data size, probably due to cassandra adding column names with every value (even with compression 28%).
  • Second Trial - Load using Hadoop, hourly and daily precomputed data, 1 row per article (project, article, timestamp, [16 columns based on precomputation over agent and access dimesions])
    • Data size in cassandra: ~70G / day - Better but still too big to keep hourly data.
  • Third trial - Load using Hadoop, daily only precomputed data, 1 row per article (project, article, timestamp, [16 columns based on precomputation over agent and access dimesions])
    • Data size in cassandra: ~17G / day
    • Parallel loading - 1 day at a time () - No more problem
    • Replication factor downsized from 3 to 2
    • Real size on disk for 27 days: [TBD - Waiting for replication to take effect]
    • This setup buys us [TBD - Waiting for replication to take effect]
Querying
  • During second trial
    • Requests time out during loading when not using a specific primary key.
Cassandra old hourly 101ms, 130ms,355ms
Cassandra old daily 222ms, 102ms
Cassandra new hourly 0.9 seconds on average (when run through cqlsh, the numbers above are probably closer to reality)
Cassandra new daily 0.9 seconds on average (when run through cqlsh, the numbers above are probably closer to reality)
  • During third trial, after backfilling (no load)
    • Queries take about 300ms on average (between 200ms and 400ms, 20+ tests).

Test consisted of loading 1 full day of hourly data in json format. Data file is about 35G. In order to manipulate the file we rsync it from prod to labs to couple large size instances.

See below query in hive:

 select concat(
            '{"project":"', project,
            '","article":"', page_title,
            '","access":"', access_method,
            '","agent":"', agent_type,
            '","view_count":"', sum(view_count),
            '","time":"2015-10-14T', LPAD(hour, 2, '0'), ':00:00.000Z"',
            '}'
        )
   from wmf.pageview_hourly
  where year=2015 and month=10 and day=14
  group by project, page_title, access_method, agent_type, hour
;

Druid

   * Box: druid1.eqiad.wmflabs, m1.large with 80GB storage.
   * Requirements: openjdk-1.7.0_79, node-4.1, imply-1.0.0
   * 4 CPUS with 8192 Mb of RAM

Druid is installed in /srv/druid/imply-1.0.0. While loading, it stores data in /tmp (it runs a local hadoop job). After it's done, it stores the segments in /srv/druid/imply-1.0.0/var/druid, but that's configurable to output to HDFS.

Results

  • It took around 40 hours to load 35GB of pageview data (/data/project/milimetric/pageviews.2015.10.14.hourly.json) and the final size of the indexed data resulted around 15GB in /srv/druid/imply-1.0.0/var/druid (there are 9GB of files in tmp/ there but I checked and they're not needed for query performance or anything).
  • The queries and indexing tasks are in /srv/druid/imply-1.0.0/quickstart. Hourly data breakdowns return in about 4.8 seconds, daily in about 4.3, subsequent runs appear cached and run very fast.
  • The values shouldn't be directly compared too closely with other loads, we used "ignoreBadRows" so not everything is loaded.

Other Notes

Disk usage

   * 1 day of data, October 6, 2015:
       * raw TSV format for 24 individual hours + 1 day aggregate data: 80GB
       * zipped TSVs, same as above: 8GB
           * Druid: 15GB (needs about 80GB of temporary space while crunching down to 15GB)

Response times

   * General
   ** Druid:                    First ever query took 92 seconds, all other queries didn't have that lag
   * 1 day
   ** Druid hourly:             4,810ms first query, 59ms second
   ** Druid daily:              4,260ms first query, 90ms second

Elasticsearch

   * Box: elastic1.eqiad.wmflabs, m1.xlarge with 160GB storage.
   * Requirements: openjdk-1.7.0_79, elasticsearch-1.7.3, python 2.7.9, elasticsearch-py
   * Number of CPUs: 8 and RAM Size 16384 Mb

Elasticsearch is installed in /srv/elasticsearch-1.7.3, and it stores the indexed data in /srv/elasticsearch-1.7.3/data. In /srv/loading there are a couple files and scripts we used to load and test the data. In /srv/loading/README you can find several handy commands to repeat the test.

Results

  • The script to load the data took around 10 hours to load 35GB of pageview data (/data/project/milimetric/pageviews.2015.10.14.hourly.json) and the final size of the indexed data resulted around 22GB in /srv/elasticsearch-1.7.3/data.
  • The query /srv/loading/pageviews_time_range.json queries for pageviews by project given a time range. The query returns in 6-8 seconds for the whole day.
  • The values returned are similar to the ones that pivot returns (https://druid.wmflabs.org/pivot). But not identical... ES returns values that are around 5% higher. Can this be because of the bad records, that druid discards, and ES is maybe swallowing?