User:AKhatun/Intro to WMF Search Data
Search Data
The search platform team at the foundation saves some temporary data from searches done in various wikimedia projects, analyzing which can help us understand what improvements can benefit users and what we can do to create better search experience for them. To do this, we need to first understand how search works and what are the various data stored. This page is intended to help you get started with search and search data: with resources, links, and brief explanations. This is not an exhaustive list or a complete explanation of all things related to search.
Where can you search from?
- There is a search bar in every page you visit. The main page, content page, every page. This is called the GO box.
- The landing page for most projects have a search bar in the middle of the page. Like www.wikipedia.org, www.wiktionary.org, etc
- The search special page, like en.wikipedia.org/wiki/Special:Search
How search works
As you start typing on any of the search boxes mentioned above, the search process has already started. Every letter/group of letter typed fires a search event; once you press enter/click the magnifying glass icon, an event is fired; once you click a search result from the search result page, another event is fired. More about events later.
Here are some of the possibilities with searching:
- You start typing in the GO box or any other mediawiki search bar. After each letter you type, you get a drop down of tittle suggestions. These are called autocomplete searches. Sometimes if you type mutiple letters with quick succession, you will get these suggestions when you pause.
- You can click one of the tittle suggestions and go to that page directly
- Or, you can press enter or select search for pages containing <your text>. This takes you to the search results page.
- In the search results page, you will see your search results, results from other langauge wikis (if applicable), results from sister projects, and advanced search options. This is also the search special page. You can continue to perform other searches from here or read your results.
- Sometimes, the word or phrase you searched for may have no results. If the system thinks you meant something else, it will search for that and show those results instead. Search for azpw, the results will be populated for the word aziz and says Showing results for aziz. No results found for azpw.
- Sometimes the word or phrase you searched for has very few results. If the system thinks you meant something else, it will recommend a different (possibly correct) search. Search for alsha, it will say Did you mean: alpha. It still shows the little results it found for alsha, but you can click on alpha and view those results instead.
- On the side are results from sister projects
- At the bottom of the results are results from other language wikis if applicable. Search for বন্য প্রানি (a not English query) in the English wikipedia, for example.
- Some wikis have results from Wikidata at the bottom as well.
Useful resources
- mw:Wikimedia_Discovery/So_Many_Search_Options
- Default MediaWiki search: m:Help:Searching
- CirrusSearch: mw:Help:CirrusSearch
Few blog posts. Find more in diff.wikimedia.org.
- wmfblog:2015/12/23/search-and-discovery-on-wikipedia
- wmfblog:2021/02/22/in-search-of-the-perfect-search-for-wikipedia
- wmfblog:2019/03/12/the-anatomy-of-search-a-place-for-my-stuff
Data Sources
Table name | Database | Description | Docs | Code |
---|---|---|---|---|
mediawiki_cirrussearch_request | event | Also known as query logs. Contains all search events including the query, the various hits returned from one or more wiki projects, time taken, and other backend information | Schema | - |
searchsatisfaction | event | Table of various search events such as searchResultPage, click, checkin etc along with the query, number of hits returned and other search specific details. | Schema | Source Code |
query_clicks_hourly | discovery | A cross of mediawiki_cirrussearch_request and searchsatisfaction to list each search query with its list of hits returned and clicks by the user | Schema | Source Code |
query_clicks_daily | discovery | Sessionized version of the discovery.query_clicks_hourly table. Only contains queries with click throughs | Schema | Source Code |
search_satisfaction_daily | discovery | A sessionized daily version of the event.searchsatisfaction table. Each search session and most of its related information are aggregated in individual rows | - | Source Code |
fulltext_head_queries | discovery | Aggregate of queries and their results after making some minor alterations to the query string (e.g please and PLEASE --> please) | - | Source Code |
Table details
event.mediawiki_cirrussearch_request
This table is a bit loaded but actually relatively easy to understand. It has a bunch of metadata and client info. You can find info on most fields in the schema. Some additional information that might help:
- The
search_id
set here corresponds to thesearchToken
in the searchsatisfaction table. - The more dense and important search related fields are
elasticsearch_requests
andhits
. - There are bunch of "indices" saved. Some for contents of various wiki projects of various languages, others for page tittles for example. The search happens by mapping the search query against these indices.
- The
hits
field contains the final list of hits from CirrusSearch that are shown to the user. This includes the page title, page id, score of the result given by ElasticSearch and the index the result was grabbed from. - Search occurs in several steps. ElasticSearch performs search and collects a list of results. It also generates a score for each search result to show how relevant of a result it was. CirrusSearch sits on top of ElasticSearch and modifies and enhances the search results that are ultimately shown to the user. A single search from the user performs multiple ElasticSearch searches: For the language wiki you searched in, for other wiki projects (thats how we get results from other wiki projects on the side), for other language projects if relevant for the search. Detecting the lanaguge and identifying whether searching in other wiki langauges is required or not is also part of the job.
- Since a single search can have several ElasticSearch requests, each request and its relevant results ae listed in the
elasticsearch_requests
field.hits
contains the list of results returned from each of the individual ElasticSearch requests.indices
contains the list of all indexes the query was performed against. hits[].index on the other hand contains the index from which that particular result came from.- When users perform pagination on search results, i.e, see the "next" set of search results, the offset (number of results already shown) is given in
hits_offset
. - Every resturned search result has a score.
max_score
is the max of those, typically the score of the first search result shown.
event.searchsatisfaction
The tables schema contains description about most fields. Here are some additional notes to help augment the understanding:
- Every wikimedia project will have a search bar at the top. Even the Special:Search page. Special:Search page is considered as just another wiki page. Whereever you start typing your search query, that page's id will be stored in the
articleId
field. For Special:Search page it will be null. - Imagine a search session you had started from a random content page. As you start typing, the system suggests pages based on tittle matches. This fires
searchResultPage
(set inaction
) event whosesource
field isautocomplete
. - After you typed things in, you select one of the suggested pages. This generates a
click
(set inaction
field) event. Theposition
field contains the 0-indexed position of the search result you just clicked. Once you start visiting another page, we don't have any other info anymore. - Lets assume instead of clicking one of the suggested pages, you press enter or click the magnifying-icon button. This will take you to the Search:Speacial page. You will get:
- A
click
event. The position field in this event will be -1 since you did not click any of the autocomplete search results. - A
visitPage
event. visitPage means you just visited the Search:Special page. - A
searchResultPage
event. This will be afulltext
search, set in thesource
field. - As you browse through the search results,
checkin
events are fired at regular intervals upto 7 minutes. Since you are browsing the search special page, thearticleId
would be null.
- A
- Every event generated from one load of a certain page will have the same
pageViewId
. So the visitPage, searchResultPage, and checkin events generated from the search:Special page will have the samepageViewId
. The click event will have a differentpageViewId
though, since it was generated form the page you wrote the query in. So the previous autocomplete events and the click event will have the samepageViewId
. - Now let's choose one of the search results.
- This generates a
click
and avisitPage
event. Both will have position set to the 0-indexed position of the result you clicked. - visitPage event will have the page id of the page you visited in the
articleId
field. - Then let's assume you are reading the article you just clicked. As you spend more time on it,
checkin
events are fired at regular intervals. And thearticleId
would be the id of the page you are reading.
- This generates a
- If you click the back button on the browser you will go back to the fulltext search page, the page from where you selected the search result. This will generate a new search with the same query, and so will have a
searchResultPage
event. - Typically the search result page shows 20 results by default. Now suppose you click the pagination buttons, i.e choose to see the next 20 results, or maybe you choose to see 50 or 100 results in the current page. These actions will generate
searchResultPage
events. TheextraParams
field will have the offset value in it. So if you had chosen to see the next set of results, the offset will be 20 (for the first 20 pages). Or, suppose you viewed 50 results in the first page, and then clicked to see the next set of results, the offset will be 50. - Other params in
extraParams
: Theiw
key in extraParams has a list of sister projects. One result from each of these wikis was shown on the search result page, typically on the side. The name of the wiki is in abbreviated form. See the abbreviations here. Along with the wikis abbreviated name is the position of the wiki project, among the sister projects' search result list. So{"source":"q", "position":3
means there was a result from wikiquote in the 3rd position among the results from various wiki projects. - Clicking any one the sister project links gives
ssclick
event. The extraParams of this event will have the link of the page you clicked, but no information about dwell time or anything else. - Normal
searchResultPage
events have ainputLocation
of header (when you search from the content pages and the GO box in the header) or content (when you search from the search box in the content/body section of the search:Special page). - If a search does not produce enough results and it finds another word or phrase that closely matches with what we typed, it will show us "did you mean" suggestions. If this happens
didYouMeanVisible
field is "yes".- If we click the suggestion provided to us, a new search with the suggested query takes place where the
inputLocation
is "dym-suggest"
- If we click the suggestion provided to us, a new search with the suggested query takes place where the
- If the query you searched for has no results and the search engine finds a close word or phrase, it shows results for that instead. In this case
didYouMeanVisible
field is "autorewrite"- Even though the original query has no results at all, you can still click "search for <original_query> instead". This creates another search event with
inputLocation
as "dym-original".
- Even though the original query has no results at all, you can still click "search for <original_query> instead". This creates another search event with
- hover on, hover off, and esclick are not in use as of Aug 2022.
Note:
- As of now, opening search results in multiple tabs by double-clicking is not recorded as an event. Double click is not considered a "click". So it does not store visitPage or checkin events either.
- One can perform searches and visit a page from the result set through single clicks and therefore loading it on the same page. It stores searchResultPage, visitPage, and checkin upto this step. Clicking on links on the content page you loaded from the search results and going deeper down the wikipedia hole will not store events any longer. Once you click the browser "back" button and return to the search page, the search action will be performed again and you can continue searching while the events get fired.
discovery.query_clicks_hourly
The fields of this table are fairly clear from its schema definition. It contains the list of all the search results shown to the users with each full-text search and the list of all the pages the user clicked in each search. See Schema and Source Code for more details.
discovery.query_clicks_daily
Sessionized version of the hourly table. This table contains full-text search sessions with click thoughs. If you want all search sessions with ot without click throughs, you will have to check out the hourly table. Simply gives session_id to the queries.
discovery.search_satisfaction_daily
What is a search session?
"A search session identifies a single user performing searches within a limited timespan. If no search is performed within ten minutes of a previous search a new session id is generated." [1] So, whatever a user does after searching, like clicking around, viewing pages, viewing next set of results are all given the same sessionID
. A new session starts when this session is idle for 10 minutes.
discovery.search_satisfaction_daily
is a sessionized version daily of event.searchsatisfaction
. The event
table records each event separately whereas the daily
table records searches session-wise with seperate rows for each full-text search (not autocomplete searches, only the searches done by users by pressing enter or the magnifying glass icon).
Additional explanation of some fields: Make sure to do describe table_name;
in hive or spark sql or whatever method you are accessing it through to see field comments.
dym_shown
: Whether the search engine result page (SERP) showed a Did You Mean (dym) suggestion. If the number of results is too less, the search engine will try to identify nearby words or phrases to search with and show that query as a suggestion to the user. If the number of results is 0, the engine will perform search with the suggested query and show those results instead. When these situations occurdym_shown
is True.is_autorewrite_dym
: The phenomenon of getting 0 results and so showing results of the suggested query is calledautorewrite
.is_dym
: When the user cicks the dym suggestion, a search is performed with the suggested query. The new result page hasis_dym
set to True, because this is the dym suggested query search. It is also True for autorewrite queries since the the page is showing results for the suggested queries.dym_clicked
: True when a user clicks the suggested query shown at the top of the page, i.e, the Did You Mean query.
N.B.: In case of a autorewrite dym_shown
, is_autorewrite_dym
, and is_dym
are all True. For more info about the logic, see the source code: Source Code#L128-L148
discovery.fulltext_head_queries
This table is not much used at present. It contains:
norm_query
: The normalized query. Only few very basic normalizations were performed. See Source Code docs for more info on what normalizations were done. Queries are normalized and then grouped together based on the normalized version.num_sessions
: The number of sessions across which these queries had spanned (and are now grouped together).queries
: The original queries that were normalized tonorm_query
along with the number of sessions each query was part of.
References
Abbreviations
- SERP: Search Engine Result Page
- dym: Did You Mean (the alternate query suggestion that comes after a search that does not have enough results)