User:Oren
- communicate via http://etherpad.wikimedia.org/DeploymentPrep
- http://www.mediawiki.org/wiki/Extension:MWSearch
Some Java Related Puppet Definitions
Classes
- apt
- apt::clean-cache
apt::clean-cache
Variables
- $apt_clean_minutes*: cronjob minutes - default uses ip_to_cron from module "common"
- $apt_clean_hours*: cronjob hours - default to 0
- $apt_clean_mday*: cronjob monthday - default uses ip_to_cron from module "common"
Require: - module common (http://github.com/camptocamp/puppet-common)
Definitions
* apt::conf * apt::key * apt::sources_list
apt::conf
apt::conf{"99unattended-upgrade": ensure => present, content => "APT::Periodic::Unattended-Upgrade \"1\";\n", }
apt::key
apt::key {"A37E4CF5": source => "http://dev.camptocamp.com/packages/debian/pub.key", }
apt::sources_list
apt::sources_list {"camptocamp": ensure => present, content => "deb http://dev.camptocamp.com/packages/ etch puppet", }
Jenkins
class jenkins { include apt apt::key {"D50582E6": source => "http://pkg.jenkins-ci.org/debian/jenkins-ci.org.key", } apt::sources_list {"jenkins": ensure => "present", content => "deb http://pkg.jenkins-ci.org/debian binary/", require => Apt::Key["D50582E6"], } package {"jenkins": ensure => "installed", require => Apt::Sources_list["jenkins"], } service {"jenkins": enable => true, ensure => "running", hasrestart=> true, require => Package["jenkins"], } }
Search setup
using maven + ant to do the build + sync
- Getting Windows, Eclipse, Ant and Rsync to Play Nicely Together
- apend text to file with ANT
- search and replace with ANT
- finding string in file via ANT
client wikis
MWSearch - needs to be configured with the ips of all searchers OpenSeachXml - not sure what if it is significant
both
notpeter can supply a debian
- java
- ant
svn of lucenesearch
A script is required make a like the file://home/wikipedia/common/pmtpa.dblist with the list of all the wikis that should be indexed in the same format
deployment-searchidx
- currently uses 600gig, 48 gigs of ram
needs access to
- locasettings of each wiki being indexed
- scripts - Migrate to puppet - assigned notPeter
- https://gerrit.wikimedia.org/r/#patch,sidebyside,1825,1,files/lucene/lucene.jobs.sh - that is a script that I made based on the scripts that run as crons in prod
- there are a couple of others for doing things like building the .jar on the search indexer and pushing it out to the various search boxes
- there are also some start/stop scripts that I want to turn into an init script
deployment-searcher01
search boxes have 16 or 32 gigs of ram, and looks to be using 50-100 gigs of storage
Local.config
diff --git a/templates/lucene/lsearch.conf.erb b/templates/lucene/lsearch.conf.erb index 9da2958..ad5b5f2 100644 --- a/templates/lucene/lsearch.conf.erb +++ b/templates/lucene/lsearch.conf.erb @@ -1,16 +1,22 @@ 1 +###################################################### 2 +##### THIS FILE IS MANAGED BY PUPPET #### 3 +##### puppet:///templates/search/lsearch.conf.erb #### 4 +###################################################### 5 + 6 # By default, will check /etc/mwsearch.conf 7 8 ################################################ 9 # Global configuration 10 ################################################ 11 12 +### TO DO: resturcture so this doesn't depend on nfs 13 # URL to global configuration, this is the shared main config file, it can 14 # be on a NFS partition or available somewhere on the network 15 MWConfig.global=file:///home/wikipedia/conf/lucene/lsearch-global-2.1.conf 16 17 # Local path to root directory of indexes 18 Indexes.path=/a/search/indexes 19 20 # Path to rsync 21 Rsync.path=/usr/bin/rsync 22 @@ -28,21 +34,21 @@ 34 Search.updateinterval=0.5 35 36 # In seconds, delay after which the update will be fetched 37 # used to scatter the updates around the hour 38 Search.updatedelay=0 39 40 # In seconds, how frequently the dead search nodes should be checked 41 Search.checkinterval=30 42 43 # Disable wordnet aliases -Search.disablewordnet=true 44 +Search.disablewordnet=<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%>true<% else -%>false<% end -%> 45 46 ################################################ 47 # Indexer related configuration 48 ################################################ 49 50 # In minutes, how frequently is a clean snapshot of index created 51 # 2880 = two days 52 Index.snapshotinterval=2880 53 54 # Daemon type (http is started by default) @@ -50,41 +56,51 @@ 56 57 # Port of daemon (default is 8321) 58 #Index.port=8080 59 60 # Maximal queue size after which index is being updated 61 Index.maxqueuecount=5000 62 63 # Maximal time an update can remain in queue before being processed (in seconds) 64 Index.maxqueuetimeout=120 65 66 +<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%> 67 Index.delsnapshots=true 68 +<% end -%> 69 70 ################################################ 71 # Log, ganglia, localization 72 ################################################ 73 74 SearcherPool.size=6 75 76 +### TO DO: resturcture so this doesn't depend on nfs 77 # URL to message files, {0} is replaced with language code, i.e. En 78 Localization.url=file:///home/wikipedia/common/php/languages/messages 79 80 +<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%> 81 # Pattern for OAI repo. {0} is replaced with dbname, {1} with language 82 #OAI.repo=http://{1}.wikipedia.org/wiki/Special:OAIRepository 83 OAI.username=lsearch2 84 OAI.password=<%= lucene_oai_pass %> 85 +<% end -%> 86 # Max queue size on remote indexer after which we wait a bit 87 OAI.maxqueue=5000 88 89 # Number of docs to buffer before sending to inc updater 90 OAI.bufferdocs=500 91 92 +<% if scope.lookupvar('search::searchserver::indexer') == "true" then -%> 93 +# UDP Logger config 94 +UDPLogger.port=51234 95 +UDPLogger.host=208.80.152.184 96 +<% end -%> 97 98 # RecentUpdateDaemon udp and tcp ports 99 #RecentUpdateDaemon.udp=8111 100 #RecentUpdateDaemon.tcp=8112 101 # Hot spare 102 #RecentUpdateDaemon.hostspareHost=vega 103 #RecentUpdateDaemon.hostspareUdpPort=8111 104 #RecentUpdateDaemon.hostspareTcpPort=8112 105 106 # Log configuration
Global.Config
http://noc.wikimedia.org/conf/lsearch-global-2.1.conf
# Logical structure, maps different roles to certain db [Database] {file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3) enwiki : (nssplit,2) enwiki : (nspart1,[0],true,20,500,2) enwiki : (nspart2,[],true,20,500) enwiki : (spell,40,10) (warmup,500) mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en) commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[]) dewiki, frwiki : (spell,20,5) dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[]) (warmup,0) [Database-Group] <all> : (titles_by_suffix,2) (tspart1,[ wiki|w ]) (tspart2,[ wiktionary|wikt, wikibooks|b, wikinews|n, wikiquote|q, wikisource|s, wikiversity|v]) sv-titles: (titles_by_suffix,2) (tspart1,[ svwiki|w ]) (tspart2,[ svwiktionary|wikt, svwikibooks|b, svwikinews|n, svwikiquote|q, svwikisource|src]) mw-titles: (titles_by_suffix,1) (tspart1, [ mediawikiwiki|mw, metawiki|meta ]) # Search hosts layout [Search-Group] # search 1 (enwiki) search1: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search2: enwiki.nspart1.sub1.hl enwiki.spell #enwiki.nspart1.sub2.hl search3: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search4: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search5: enwiki.nspart1.sub2.hl enwiki.spell #enwiki.nspart1.sub1.hl search8: enwiki.prefix #enwiki.spell search9: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search12: enwiki.spell search13: enwiki.nspart2* # disable en-titles using a non-existent hostname ending in "x" search13x: en-titles* search14: enwiki.nspart1.sub1.hl search19: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl search20: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl # search 2 (de,fr,jawiki) search6: dewiki.nspart1 dewiki.nspart2 frwiki.nspart1 frwiki.nspart2 jawiki.nspart1 jawiki.nspart2 search6: itwiki.nspart1.hl search15: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl search16: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl search17: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl # search 3 (it,nl,ru,sv,pl,pt,es,zhwiki) search14: eswiki #search20: eswiki search7: itwiki.nspart1 ruwiki.nspart1 nlwiki.nspart1 svwiki.nspart1 plwiki.nspart1 ptwiki.nspart1 zhwiki.nspart1 #search7: itwiki.nspart1 itwiki.nspart2 nlwiki.nspart1 nlwiki.nspart2 ruwiki.nspart1 ruwiki.nspart2 svwiki.nspart1 #search9: svwiki.nspart2 plwiki.nspart1 plwiki.nspart2 ptwiki.nspart1 ptwiki.nspart2 zhwiki.nspart1 zhwiki.nspart2 search15: itwiki.nspart2 nlwiki.nspart2 ruwiki.nspart2 svwiki.nspart2 plwiki.nspart2 ptwiki.nspart2 zhwiki.nspart2 search15: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl #search15: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search15: ptwiki.nspart1.hl ptwiki.nspart2.hl search16: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search16: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search16: ptwiki.nspart1.hl ptwiki.nspart2.hl search17: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search17: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search17: ptwiki.nspart1.hl ptwiki.nspart2.hl # search 2-3 interwiki/spellchecks # disable titles by using a non-existent hostname ending in "x" search10x: de-titles* ja-titles* it-titles* nl-titles* ru-titles* fr-titles* search10x: sv-titles* pl-titles* pt-titles* es-titles* zh-titles* search10: dewiki.spell frwiki.spell itwiki.spell nlwiki.spell ruwiki.spell search10: svwiki.spell plwiki.spell ptwiki.spell eswiki.spell # search 4 # disable spell/hl by using a non-existent hostname ending in "x" search11x: commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2 search11: commonswiki.nspart1 commonswiki.nspart1.hl commonswiki.nspart2.hl search11: commonswiki.nspart2 search11: *? # disable tspart by using a non-existent hostname ending in "x" search11x: *tspart1 *tspart2 search19: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.))*.spell search12: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.|jawiki.|zhwiki.))*.hl # prefix stuffs search18: *.prefix # stuffs to deploy in future searchNone: *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl searchNone: enwiki.spell enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl # Indexers [Index] searchidx2: * # Rsync path where indexes are on hosts, after default value put # hosts where the location differs # Syntax: host : <path> [Index-Path] <default> : /search [OAI] simplewiki : http://simple.wikipedia.org/w/index.php rswikimedia : http://rs.wikimedia.org/w/index.php ilwikimedia : http://il.wikimedia.org/w/index.php nzwikimedia : http://nz.wikimedia.org/w/index.php sewikimedia : http://se.wikimedia.org/w/index.php alswiki : http://als.wikipedia.org/w/index.php alswikibooks : http://als.wikibooks.org/w/index.php alswikiquote : http://als.wikibooks.org/w/index.php alswiktionary : http://als.wiktionary.org/w/index.php chwikimedia : http://www.wikimedia.ch/w/index.php crhwiki : http://chr.wikipedia.org/w/index.php roa_rupwiki : http://roa-rup.wikipedia.org/w/index.php roa_rupwiktionary : http://roa-rup.wiktionary.org/w/index.php be_x_oldwiki : http://be-x-old.wikipedia.org/w/index.php ukwikimedia : http://uk.wikimedia.org/w/index.php brwikimedia : http://br.wikimedia.org/w/index.php dkwikimedia : http://dk.wikimedia.org/w/index.php trwikimedia : http://tr.wikimedia.org/w/index.php arwikimedia : http://ar.wikimedia.org/w/index.php mxwikimedia : http://mx.wikimedia.org/w/index.php commonswiki: http://commons.wikimedia.org/w/index.php [Namespace-Boost] commonswiki : (0, 1) (6, 4) <default> : (0, 1) (1, 0.0005) (2, 0.005) (3, 0.001) (4, 0.01), (6, 0.02), (8, 0.005), (10, 0.0005), (12, 0.01), (14, 0.02) # Global properies [Properties] # suffixes to database name, the rest is assumed to be language code Database.suffix=wiki wiktionary wikiquote wikibooks wikisource wikinews wikiversity wikimedia # Allow only up to 500 results per page Search.maxlimit=501 # Age scaling based on last edit, default is no scaling # Below are suffixes (or whole names) with various scaling strength AgeScaling.strong=wikinews AgeScaling.medium=mediawikiwiki metawiki #AgeScaling.weak=wiki # Use additional per-article ranking data, more suitable for non-encyclopedias AdditionalRank.suffix=mediawikiwiki metawiki # suffix for databases that should also have exact-case index built # note: this will also turn off stemming! ExactCase.suffix=wiktionary jbowiki # wmf-style init file, attempt to read OAI and lang info from it # for sample see http://noc.wikimedia.org/conf/InitialiseSettings.php.html #WMF.InitialiseSettings=file:///home/wikipedia/common/php-1.5/InitialiseSettings.php #WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-deployment/wmf-config/InitialiseSettings.php WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-config/InitialiseSettings.php # Where common images are Commons.wiki=commonswiki.nspart1 # Syntax: <prefix_name> : <coma separated list of namespaces> # <all> is a special keyword meaning all namespaces # E.g. all_talk : 1,3,5,7,9,11,13,15 [Namespace-Prefix] all : <all> [0] : 0 [1] : 1 [2] : 2 [3] : 3 [4] : 4 [5] : 5 [6] : 6 [7] : 7 [8] : 8 [9] : 9 [10] : 10 [11] : 11 [12] : 12 [13] : 13 [14] : 14 [15] : 15 [100] : 100 [101] : 101 [104] : 104 [105] : 105 [106] : 106 [0,6,12,14,100,106]: 0,6,12,14,100,106 [0,100,104] : 0,100,104 [0,2,4,12,14] : 0,2,4,12,14 [0,14] : 0,14 [4,12] : 4,12
todo
- what we would need for the indexer to work for the long term is some script that make a file://home/wikipedia/common/pmtpa.dblist with the list of all the wikis that should be indexed in the same format
- push search resources into an artifactory repository
- localsetting.php
- dump.b2z
repository layout
in /org/wikimedia/labs/
- bastion
- search
- search-test
- deployment-prep
- deployment-sql
- deployment-squid
- deployment-dbdump
- deployment-nfs-memc
- deployment-web
- deployment-indexer
- deployment-searcher
murder
https://github.com/lg/murder/blob/master/README.md
OAIRepository testing and what it does
here is a transcript of brion on OAIReposiory
pop over to say https://en.wikipedia.org/wiki/Special:OAIRepository
at the HTTP auth prompt use usr 'testing' pass 'mctest' is this a public login? I think we should suppress it from ep, if it was public why is there a login
see http://www.openarchives.org/OAI/openarchivesprotocol.html for general protocol documentation to install locally.... in theory:
make sure you've got OAI extension dir in place
and do the usual require "$IP/extensions/OAI/OAIRepo.php";
you'll only need the repository half
run maintenance/updates.php to make sure it installs its tables...
which i think should work
as pages get edited/created/deleted, it'll internally record things into its table and those records can get read out through the Special:OAIRepository iface iirc it records the page id (?), possibly a rev id, and a created/edited/deleted state flag then the interface slurps out current page content at request time it's meant to give you current versions of stuff, rather than to show you every individual change (eg, potentially multiple changes since your last query will be "rolled up" into one, and you just download the entire page text as of the last change) or if it's deleted, you get a marker indicating the page was deleted it's relatively straightforward, but doesn't always map to what folks want :) for search index updates it's good enough... as long as you're working with source all you probably need is the 'ListRecords' verb here is a url to test OAI for search: https://en.wikipedia.org/wiki/Special:OAIRepository?verb=ListRecords&metadataPrefix=lsearch