User:Razzi
Learning the Wikimedia stack! Trying to be the most versatile SRE at the wikimedia foundation! For now learning data engineering tools: kafka, spark, scala, mysql. See grand plan here
Tutorials
How to install java 8 on debian (TODO)
Documentation
18 May 2022
- 18:4918:49, 18 May 2022 diff hist +33 m Data Platform/Systems/Superset →Deploy to production Tag: Visual edit
17 May 2022
- 20:4720:47, 17 May 2022 diff hist +134 m Ganeti →Start the VM Tag: Visual edit
16 May 2022
- 13:3413:34, 16 May 2022 diff hist +59 Data Platform/Systems/Superset →Test a different role's permissions Tag: Visual edit
12 May 2022
- 20:2620:26, 12 May 2022 diff hist +22 m Data Platform/Systems/Superset →Test a different role's permissions Tag: Visual edit
- 20:2420:24, 12 May 2022 diff hist +1,867 Data Platform/Systems/Superset →Administration Tag: Visual edit
- 15:2515:25, 12 May 2022 diff hist −6 m Server Lifecycle →Manual installation: Fix some typos Tag: Visual edit
- 14:4214:42, 12 May 2022 diff hist +82 User:Razzi →Random notes current Tag: Visual edit
Article list
- 2021-04-20
- 2021-05-5
- 2021-06-09
- 2021-06-1
- 2021-06-10
- 2021-06-14
- 2021-06-30
- 2021-07-01
- 2021-07-30
- 2021-08-02
- 2021-09-09
- 2021-09-24
- 2022-02-07
- 2022-03-28
- A week with the search team
- Analytics notes
- Debugging eventlogging to druid network flows internal hourly.service
- Developing cookbook locally
- Experiment: use puppet notice to show variable
- First logical volume resizing
- First pass at understanding T300164 varnishkafka alerts
- Ganeti error: Connection to console of instance datahubsearch1002.eqiad.wmnet failed
- How to depool / pool a host from etcd
- How to run systemd unit of another user
- How to show mysql host from sql query
- How to view pooled services for lvs
- Installing puppet on mac
- Learning about partitions for flerovium/furud
- Looking into The following units failed: wmf auto restart prometheus-mysqld-exporter@matomo.service
- NameNode vs DataNode
- Notes on clouddb views
- Plan to drain hadoop cluster
- Presto query logging: https://phabricator.wikimedia.org/T269832
- Puppetboard
- Set up haproxy on mediawiki-vagrant
- Setting up kerberos locally
- Spicerack python api repl
- Superset 1.3.1 upgrade recap
- T279304
- T280132 disk swap
- Triage Superset Dashboard Timeouts - T294768
- What is conftool
- alertname: Icinga/Check correctness of the icinga configuration
- an-master reimaging
- code search
- common.js
- deployment train 5-18
- firewall audit
- fm/CFSSL
- fm/SCSI
- grand SRE IC plan
- https://phabricator.wikimedia.org/T298505
- learning storage on vagrant
- logs
- new plan for reimaging an-masters
- rebase off of origin in one command
- reimage of db1125
- scratch
- snippets
- ssh config
- ssh single letter domain shortcut
- superset 1.3.1 errors
- vector.css
- working with apache atlas in docker
Questions
What are the "unauthenticated user" users in show processlist seen in
razzi@clouddb1014:~$ sudo mysql -S /var/run/mysqld/mysqld.s7.sock -e 'show processlist;'
How does refine use salts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/679939
Is /system a default directory for hadoop, or can we remove it?
Is there a place that lists the vlans?
How to check vlan for a host?
Q: Is it expected that when reimaging a host, we see the old name when running homer?
[edit interfaces interface-range disabled] - member ge-1/0/13; [edit interfaces interface-range vlan-analytics1-d-eqiad] + member ge-1/0/13; member ge-1/0/43 { ... } [edit interfaces] + ge-1/0/13 { + description "db1125 {#2221}"; + }
^ this is while decommissioning db1125
A: No, I skipped some netbox steps; when I fixed them this didn't show up
Q: How to submit a test job to the yarn queue to test if it is accepting jobs?
Q: What to do about this warning on analytics1068?
May 06 21:03:35 analytics1068 systemd[1]: /run/systemd/generator.late/hadoop-yarn-nodemanager.service:18: PIDFile= late/hadoop-yarn-nodemanager.service:18: PIDFile= references path below legacy directory /var/run/, updating /var/run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid; please update the unit file accordingly.
Q: Server Lifecycle#Rename while reimaging when to merge homer patch?
A: homer patch is for firewall, not having to do with the reimaging process. Merge after reimage complete
Q: What is the order for creating puppet patches when it comes to server lifecycle? Some things that might need to be avoided: having site.pp for node that is being decommissioned, having site.pp for node that doesn't exist yet
Ideas
Script to show what tickets are currently in progress
Add homer-public to codesearch
Change method
def get_runner(self, args):
"""As specified by Spicerack API."""
return UpdateWikireplicaViewsRunner(args, self.spicerack)
to work for the 99% case:
runner = UpdateWikireplicaViewsRunner
by defining get_runner
on UpdateWikireplicaViews
Random notes
sudo lsof -Xd DEL
- lists the files that have been deleted but are still held open by a running process
Puppet
Get mysql hostname: show variables where Variable_name='hostname';
Why does sshing into mgmt not accept the password?
Because you forgot the `root@` part!
Instead of ssh dbstore1007.mgmt.e
do `ssh root@dbstore1007.mgmt.e`
Or make ssh use the root user in your ~/.ssh/config: https://stackoverflow.com/questions/10197559/ssh-configuration-override-the-default-username
refactor this to run automatically
Why no homer diff?
TBD
how to check what vlan a host belongs to?
???
Proposal: stop using conda for infrastructure
Why not use standard pip?
How to apply hadoop config changes?
For example https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194/1/hieradata/common.yaml
linux-host-entries.ttyS0-115200 versus linux-host-entries.ttyS1-115200
a mystery
sudo gnt-instance console an-airflow1002.eqiad.wmnet is stuck, is this normal?
Gotta stop and start, the old reboot trick
sudo gnt-instance stop an-airflow1003.eqiad.wmnet
how to restart services on hadoop coordinator?
for https://phabricator.wikimedia.org/T283067
Want to restart services for an-test-coord1001 and an-coord*
But how to do this safely?
for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have on an-coord1001 there are 1) oozie 2) presto coordinator 3) hive server 4) hive metastore and that's it IIRC oozie can be restarted anytime, no issue on that front (all the state is on the db) and we don't really have clients contacting it the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset) the hive server/coordinator is a bit more complicated they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..) so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server not the metastore ah wait I am saying something silly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator so on an-coord1002 we have both server and metastore basically a mirror of 1001 what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001) this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed so for hive, just change the DNS of analytics-hive.eqiad.wmnet to 1002, then wait for the TTL to expire and you can freely restart daemons on 1001
Set boot order to disk - "upstream is aware" - any issue to track?
Can we delete the hadoop-analytics grafana section now?
https://grafana.wikimedia.org/d/000000258/analytics-hadoop?orgId=1
DONE
Puppet failure on deployment-logstash03.deployment-prep.eqiad.wmflabs
make it stop!!!
Still happening!!!!!!!
https://phabricator.wikimedia.org/T286567#7214310
"those alerts go to all admins of the deployment-prep cloud vps project, production root@ has nothing to do with them"
what to do about this /mnt/hdfs issue
razzi@an-test-coord1001:~$ sudo lsof -Xd DEL | wc lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs Output information may be incomplete. 81 649 8819
a lotta deleted files still open on an-test-coord1001...?
razzi@an-test-coord1001:~$ sudo lsof -Xd DEL lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs Output information may be incomplete. COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME systemd 1 root DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 fuse_dfs 813 root DEL REG 253,0 2107668 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libsunec.so fuse_dfs 813 root DEL REG 253,0 2107706 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar fuse_dfs 813 root DEL REG 253,0 2107712 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jsse.jar fuse_dfs 813 root DEL REG 253,0 2107651 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjaas_unix.so fuse_dfs 813 root DEL REG 253,0 2107661 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libmanagement.so fuse_dfs 813 root DEL REG 253,0 2107663 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libnet.so fuse_dfs 813 root DEL REG 253,0 2107664 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libnio.so fuse_dfs 813 root DEL REG 253,0 2107690 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/nashorn.jar fuse_dfs 813 root DEL REG 253,0 2107685 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/cldrdata.jar fuse_dfs 813 root DEL REG 253,0 2107692 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunjce_provider.jar fuse_dfs 813 root DEL REG 253,0 2107694 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/zipfs.jar fuse_dfs 813 root DEL REG 253,0 2107693 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunpkcs11.jar fuse_dfs 813 root DEL REG 253,0 2107689 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/localedata.jar fuse_dfs 813 root DEL REG 253,0 2107686 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/dnsns.jar fuse_dfs 813 root DEL REG 253,0 2107687 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/icedtea-sound.jar fuse_dfs 813 root DEL REG 253,0 2107711 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jfr.jar fuse_dfs 813 root DEL REG 253,0 2107718 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar fuse_dfs 813 root DEL REG 253,0 2107688 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/jaccess.jar fuse_dfs 813 root DEL REG 253,0 2107671 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libzip.so fuse_dfs 813 root DEL REG 253,0 2107652 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjava.so fuse_dfs 813 root DEL REG 253,0 2107670 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libverify.so fuse_dfs 813 root DEL REG 253,0 2107674 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so fuse_dfs 813 root DEL REG 253,0 2107691 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunec.jar fuse_dfs 813 root DEL REG 253,0 1971364 /tmp/hsperfdata_root/813 systemd-l 1057 root DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 dbus-daem 1069 messagebus DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 lldpd 1070 _lldpd DEL REG 253,0 1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5 lldpd 1070 _lldpd DEL REG 253,0 1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5 lldpd 1077 _lldpd DEL REG 253,0 1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5 lldpd 1077 _lldpd DEL REG 253,0 1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5 mysqld 6126 mysql DEL REG 0,17 277921 /[aio] mysqld 6126 mysql DEL REG 0,17 277920 /[aio] mysqld 6126 mysql DEL REG 0,17 277919 /[aio] mysqld 6126 mysql DEL REG 0,17 277918 /[aio] mysqld 6126 mysql DEL REG 0,17 277917 /[aio] mysqld 6126 mysql DEL REG 0,17 277916 /[aio] mysqld 6126 mysql DEL REG 0,17 277915 /[aio] mysqld 6126 mysql DEL REG 0,17 277914 /[aio] mysqld 6126 mysql DEL REG 0,17 277913 /[aio] mysqld 6126 mysql DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 mysqld 6126 mysql DEL REG 0,17 277912 /[aio] mysqld 6126 mysql DEL REG 0,17 277911 /[aio] systemd 9708 root DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 (sd-pam) 9709 root DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 systemd 10294 oozie DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 (sd-pam) 10296 oozie DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 rsyslogd 15874 root DEL REG 253,0 1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5 rsyslogd 15874 root DEL REG 253,0 1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5 (sd-pam) 29881 razzi DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 kafkatee 33887 kafkatee DEL REG 253,0 1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3 airflow 38223 analytics DEL REG 253,0 1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5 airflow 38223 analytics DEL REG 253,0 1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
Ok, unmounting and remounting fixed the jars, the warning is still there
sudo umount /mnt/hdfs sudo mount -a
homework for you - think about a single cumin command to run the umount/mount commands above on the hosts mounting /mnt/hdfs
Is this going to involve querying for nodes with fuse?
how to make ssh an-coord1001 work without .e?
tbd
what is the deal with product data engineering? We have product analytics and we are data engineering
??
druid administration improvements
add prometheus restarts to cookbook
razzi: there is an extra caveat that I have never had the time to fix, namely the fact that after restarting the clusters a roll restart of the prometheus-druid-exporter services (one for each node) is needed 10:44 https://github.com/wikimedia/operations-software-druid_exporter#known-limitations 10:44 so the way that we collect metrics is that we force druid to push (via HTTP POST) metrics to a localhost daemon that exposes them as prometheus exporter 10:45 when a roll restart happens, the overlord and coordinator leaders will likely change 10:45 so the past leaders stop emitting metrics, and due to how prometheus works they keep pushing the last value of their metrics (and not zero or null etc..) 10:45 I believe that some fix to the exporter may resolve this 10:46 but it has been in my backlog for a long time :)
Add druid cluster docs to wikitech
razzi: analytics is the one used by turnilo superset etc.. 10:32 public is the one that gets only one dataset loaded, the mw history snapshot, and it is called by the AQS api 10:32 (and it is also not in the analytics VLAN, and has a load balancer in front of it) 10:32 the cookbook distinguish between the two especially for the pool/depool actions 10:33 and yes we need to restart both :) 10:33 also zookeeper on both clusters, that can be done running the zookeeper cookbook (there are options for both druid clusters)
Stuff to look into
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Medium-term_plan_2019/Annual_Plan_2021-2022
How to enable yubikey
Run the command here: https://wikitech.wikimedia.org/wiki/CAS-SSO#Enabling_U2F_for_an_account
sudo modify-mfa --enable UID
Then go to idp.wikimedia.org/ and logout
Then when you log in, it'll prompt and your yubikey will blink; touch it and you're good!
More broccoli (yubikey ssh)
https://wikitech.wikimedia.org/wiki/Yubikey-SSH
an-druid* versus druid*
Which one is "druid public"?
Look at modules/profile/templates/cumin/aliases.yaml.erb and your questions will be answered...
druid-analytics: P{O:druid::analytics::worker}
druid-public: P{O:druid::public::worker}
# Class: role::druid::public::worker
# Sets up the Druid public cluster for use with AQS and wikistats 2.0.
druid::analytics::worker
# Class: role::druid::analytics::worker
# Sets up the Druid analytics cluster for internal use.
# This cluster may contain data not suitable for
# use in public APIs.
Sweet
staff meeting to watch
https://www.youtube.com/watch?v=u8nMYjKg9Yg
unfortunate to have it be on youtube, account switcher disabled if not logged in properly, and distracting "related videos"...
Aug 02 00:15:14 an-launcher1002 java[14538]: unable to create directory '/nonexistent/.cache/dconf': Permission denied. dconf will not work properly.
stop this log spam!
email failing to send
razzi@puppetmaster1001:/srv/private$ ack bsitzmann modules/secret/secrets/nagios/contacts.cfg 1115: email bsitzmann@wikimedia.org
Should clean this up
release team does training
look into this
could use an article for analytics vlan
todo
Check if a new kernel is installed
ls -l /
razzi@aqs1004:~$ uname -r 4.9.0-13-amd64 razzi@aqs1004:~$ uname -a Linux aqs1004 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux razzi@aqs1004:~$ ls -l /
Search all gzipped logs
zgrep linux- /var/log/dpkg.log.*
show http request and response headers
http get https://dumps.wikimedia.org/other/geoeditors/readme.html -p Hh
show info about a systemctl service
razzi@labstore1006:~$ systemctl status analytics-dumps-fetch-geoeditors_dumps.service ● analytics-dumps-fetch-geoeditors_dumps.service - Copy geoeditors_dumps files from Hadoop HDFS. Loaded: loaded (/lib/systemd/system/analytics-dumps-fetch-geoeditors_dumps.service; static; vendor preset: enabled) Active: inactive (dead) since Thu 2021-08-19 17:53:32 UTC; 36s ago Process: 16187 ExecStart=/usr/local/bin/kerberos-run-command dumpsgen /usr/local/bin/rsync-analytics-geoeditors_dumps (code=exited, status=0/SUCCESS) Main PID: 16187 (code=exited, status=0/SUCCESS)
How to get docker to run without sudo
sudo adduser razzi docker
and log out
See something like: https://unix.stackexchange.com/questions/6387/i-added-a-user-to-a-group-but-group-permissions-on-files-still-have-no-effect/613608#613608
How to ensure a user actually signed L3?
Go to https://phabricator.wikimedia.org/legalpad/signatures/