Data Platform/Systems/Hive/Compression
Uncompressed vs. Snappy compressed Sequence Files
I just ran some rough comparisons of data sizes and Hive queries of webrequest data stored in HDFS uncompressed vs. as Snappy compressed Sequence Files.
Uncompressed
Size
In the first 8 hours of January 7th, 2014, uncompressed JSON formatted mobile webrequest logs imported into HDFS via Kafka totaled 91.1 GB. Each hourly import was between 8 and 12 GB each.
hdfs dfs -du -s -h /wmf/data/external/webrequest_mobile/hourly/2014/01/07/{00..08} 10.6 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/00 10.9 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/01 11.3 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/02 11.7 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/03 10.6 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/04 9.9 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/05 9.1 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/06 8.6 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/07 8.4 G /wmf/data/external/webrequest_mobile/hourly/2014/01/07/08 91.1 G
Query Time
Running a select count(*) query on a single hour took 44.276 seconds, and launched 42 mappers. Running the same query on 8 hours of data took 158.627 seconds and launched 343 mappers.
-- select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00; ... MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec Ended Job = job_1387838787660_0365 MapReduce Jobs Launched: Job 0: Map: 42 Reduce: 1 Cumulative CPU: 971.67 sec HDFS Read: 11422909543 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec OK _c0 16641115 Time taken: 44.276 seconds select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08; ... MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec Ended Job = job_1387838787660_0363 MapReduce Jobs Launched: Job 0: Map: 343 Reduce: 1 Cumulative CPU: 15372.42 sec HDFS Read: 98055786272 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec OK _c0 143199253 Time taken: 158.627 seconds
Snappy compressed Sequence Files
I recently got SequenceFileRecordWriterProvider.java merged upstream in LinkedIn's Camus. Using this rather than StringRecordWriterProvider.java writes out the same data as Snappy compressed Hadoop Sequence Files.
Size
JSON data imported for the same 8 hour period and Snappy compressed was 21.9 GB, 24% of the original size.
hdfs dfs -du -s -h /user/otto/data/compressed/webrequest_mobile/hourly/2014/01/07/{00..08} 2.5 G data/compressed/webrequest_mobile/hourly/2014/01/07/00 2.6 G data/compressed/webrequest_mobile/hourly/2014/01/07/01 2.7 G data/compressed/webrequest_mobile/hourly/2014/01/07/02 2.8 G data/compressed/webrequest_mobile/hourly/2014/01/07/03 2.5 G data/compressed/webrequest_mobile/hourly/2014/01/07/04 2.4 G data/compressed/webrequest_mobile/hourly/2014/01/07/05 2.2 G data/compressed/webrequest_mobile/hourly/2014/01/07/06 2.1 G data/compressed/webrequest_mobile/hourly/2014/01/07/07 2.1 G data/compressed/webrequest_mobile/hourly/2014/01/07/08 21.9 G
Query Time
The same select count(*) query on a single hour of compressed data took 86.232 seconds, about twice as long as on uncompressed data. Running the query on 8 hours worth of compressed data took 158.627 seconds, which is only 8% longer than when run on 8 hours of uncompressed data. The number of mappers launched was the same as in the uncompressed case.
select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00; ... MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec Ended Job = job_1387838787660_0365 MapReduce Jobs Launched: Job 0: Map: 42 Reduce: 1 Cumulative CPU: 971.67 sec HDFS Read: 11422909543 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec OK _c0 16641115 Time taken: 44.276 seconds select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08; ... MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec Ended Job = job_1387838787660_0363 MapReduce Jobs Launched: Job 0: Map: 343 Reduce: 1 Cumulative CPU: 15372.42 sec HDFS Read: 98055786272 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec OK _c0 143199253 Time taken: 158.627 seconds
Summary
Using Snappy to compress the JSON webrequest logs results in significant space savings, and only a slight reduction in performance for large queries. Query performance is affected for smaller data sets. I will run another test once I have more data to compare (a month), but if results are approximately the same I will not update this page.
Recommendation: use snappy compression for all webrequest imports.