Data Platform/Data Lake/Traffic/Caching
This dataset is a restricted public snapshot of the wmf.webrequest
table intended for caching research.
The most recent release is composed of caching data for:
- upload (image) web requests from one CDN cache server of upload.wikimedia.org, Wikimedia Commons, and
- text (HTML pageview) web requests from one CDN cache server of Wikipedia.
Data Updates & Format
The data is updated manually upon request.
The current release of this data, released in November 2019, is available at https://analytics.wikimedia.org/published/datasets/caching/2019/. It was released as a set of gzip
-compressed tsv
files. It includes a total of 42 compressed files, 21 upload data files and 21 text data files. The request for the most recent release of this data can be found at phab:T225538.
Upload Data
Each upload data file, denoted cache-u
, contains exactly 24 hours of consecutive data. These files are each roughly 1.5GB in size and hold roughly 4GB of decompressed data each.
Each decompressed upload data file has the following columns:
Column Name | Data Type | Notes |
---|---|---|
relative_unix | int | Seconds since start timestamp of dataset |
hashed_path_query | bigint | Salted hash of path and query of request |
image_type | string | Image type from Content-Type header of response |
response_size | int | Response size in bytes |
time_firstbyte | double | Seconds to first byte |
Text Data
Each text data file, denoted cache-t
, contains exactly 24 hours of consecutive data. These files are each roughly 100MB in size and hold roughly 300MB of decompressed data each.
Each decompressed upload data file has the following columns:
Column Name | Data Type | Notes |
---|---|---|
relative_unix | int | Seconds since start timestamp of dataset |
hashed_host_path_query | bigint | Salted hash of host, path, and query of request |
response_size | int | Response size in bytes |
time_firstbyte | double | Seconds to first byte |
Privacy
Since this data has many privacy concerns, this public release applies several changes to make the data reveal less while providing value for a public audience.
Relative timestamps
This data employs a unique timing paradigm. The relative_unix
field is equal to the seconds that have elapsed between the timestamp at which the web request occurred and a fixed randomly-selected start timestamp. This standard makes it significantly more difficult for bad actors to map this data to other publicly-released WMF data sets while maintaining the utility of timestamps for caching research as a means to determine the frequencies of these web requests.
Hashed host, path, and query
We securely hash the host, path, and query fields of the web requests with a salt to effectively anonymize the content. This both makes it more difficult to map this data between data sets and minimizes the general privacy concerns related to the combination of content, timestamps, and content sizes.
Risk Assessment
Initial Risk: Medium
Mitigations: Obfuscation, Hashing
Residual Risk: Low
The Wikimedia Foundation has developed a process for reviewing datasets prior to release in order to determine a privacy risk level, appropriate mitigations, and a residual risk level. WMF takes privacy very seriously, and seeks to be as transparent as possible while still respecting the privacy of our readers and editors.
Our Privacy Risk Review process first documents the anticipated benefits of releasing a dataset. Because we feel transparency is so crucial to free information, generally WMF takes a release-by-default approach - that is, release unless there is a compelling reason not to. Often, however, there are additional reasons for releasing a particular dataset, such as supporting research. We want to capture those reasons and account for them.
Second, WMF identifies populations that might possibly be impacted by the release of a dataset. We also specifically identify potential impacts to particularly vulnerable populations, such as political dissidents, ethnic minorities, religious minorities, etc.
Next, we catalog potential threat actors, such as organized crime, data aggregators, or other malicious actors that might potentially seek to violate a user’s privacy. We work to identify the potential motivations of these actors and populations they may target.
Finally, we analyze the Opportunity, Ease, and Probability of action by a threat actor against a potential target, along with the Magnitude of privacy harm to arrive at an initial risk score. Once we have identified our initial risks, we develop a mitigation strategy to minimize the risks we can, resulting in a residual (or post-mitigation) risk level.
WMF does not publicly publish this information because we do not want to motivate threat actors, or give them additional ideas for potential abuse of data. Unlike publishing a security vulnerability for code that could be patched, a publicly released dataset cannot be “patched” - it has already been made public.
Any dataset that contains this notice has been reviewed using this process.
Previous Releases
2016
Release: analytics.wikimedia.org
Request: T128132
Description: A more detailed, but more privacy-conscious iteration of the 2007 release between July 1st, 2016 and July 10th, 2016
Fields: hashed host and path
, uri query
, content type
, response size
, time to first byte
, X-Cache
2008
Release: wikibench.eu
Description: A trace of 10% of all user requests issued to Wikipedia in all languages between September 19th, 2007 and January 2nd, 2008
Fields: request count
, unix timestamp
, request url
, database update flag