Jump to content

User:Razzi

From Wikitech


Learning the Wikimedia stack! Trying to be the most versatile SRE at the wikimedia foundation! For now learning data engineering tools: kafka, spark, scala, mysql. See grand plan here






Tutorials

How to install java 8 on debian (TODO)

Documentation

18 May 2022

17 May 2022

16 May 2022

12 May 2022


Article list

Questions

What are the "unauthenticated user" users in show processlist seen in

razzi@clouddb1014:~$ sudo mysql -S /var/run/mysqld/mysqld.s7.sock  -e 'show processlist;'

How does refine use salts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/679939

Is /system a default directory for hadoop, or can we remove it?

Is there a place that lists the vlans?

How to check vlan for a host?

Q: Is it expected that when reimaging a host, we see the old name when running homer?

[edit interfaces interface-range disabled]
-    member ge-1/0/13;
[edit interfaces interface-range vlan-analytics1-d-eqiad]
+    member ge-1/0/13;
     member ge-1/0/43 { ... }
[edit interfaces]
+   ge-1/0/13 {
+       description "db1125 {#2221}";
+   }

^ this is while decommissioning db1125

A: No, I skipped some netbox steps; when I fixed them this didn't show up

Q: How to submit a test job to the yarn queue to test if it is accepting jobs?

Q: What to do about this warning on analytics1068?

May 06 21:03:35 analytics1068 systemd[1]: /run/systemd/generator.late/hadoop-yarn-nodemanager.service:18: PIDFile= late/hadoop-yarn-nodemanager.service:18: PIDFile= references path below legacy directory /var/run/, updating /var/run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid; please update the unit file accordingly.

Q: Server Lifecycle#Rename while reimaging when to merge homer patch?

A: homer patch is for firewall, not having to do with the reimaging process. Merge after reimage complete

Q: What is the order for creating puppet patches when it comes to server lifecycle? Some things that might need to be avoided: having site.pp for node that is being decommissioned, having site.pp for node that doesn't exist yet

Ideas

Script to show what tickets are currently in progress

Add homer-public to codesearch

Change method

    def get_runner(self, args):
        """As specified by Spicerack API."""
        return UpdateWikireplicaViewsRunner(args, self.spicerack)

to work for the 99% case:

runner = UpdateWikireplicaViewsRunner

by defining get_runner on UpdateWikireplicaViews

Random notes

sudo lsof -Xd DEL - lists the files that have been deleted but are still held open by a running process

Puppet

https://www.digitalocean.com/community/tutorials/getting-started-with-puppet-code-manifests-and-modules

Get mysql hostname: show variables where Variable_name='hostname';

Why does sshing into mgmt not accept the password?

Because you forgot the `root@` part!

Instead of ssh dbstore1007.mgmt.e

do `ssh root@dbstore1007.mgmt.e`

Or make ssh use the root user in your ~/.ssh/config: https://stackoverflow.com/questions/10197559/ssh-configuration-override-the-default-username

refactor this to run automatically

https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend

Why no homer diff?

TBD

how to check what vlan a host belongs to?

???

Proposal: stop using conda for infrastructure

Why not use standard pip?

How to apply hadoop config changes?

For example https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194/1/hieradata/common.yaml

linux-host-entries.ttyS0-115200 versus linux-host-entries.ttyS1-115200

a mystery

sudo gnt-instance console an-airflow1002.eqiad.wmnet is stuck, is this normal?

Gotta stop and start, the old reboot trick

sudo gnt-instance stop an-airflow1003.eqiad.wmnet

how to restart services on hadoop coordinator?

for https://phabricator.wikimedia.org/T283067

Want to restart services for an-test-coord1001 and an-coord*

But how to do this safely?

for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have
on an-coord1001 there are
1) oozie
2) presto coordinator
3) hive server
4) hive metastore
and that's it IIRC
oozie can be restarted anytime, no issue on that front (all the state is on the db)
and we don't really have clients contacting it
the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset)
the hive server/coordinator is a bit more complicated
they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..)
so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore
we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server
not the metastore
ah wait I am saying something silly
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator
so on an-coord1002 we have both server and metastore
basically a mirror of 1001
what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001)
this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed
so
for hive, just change the DNS of analytics-hive.eqiad.wmnet
to 1002, then wait for the TTL to expire
and you can freely restart daemons on 1001

Set boot order to disk - "upstream is aware" - any issue to track?

Ganeti#Create a VM

Can we delete the hadoop-analytics grafana section now?

https://grafana.wikimedia.org/d/000000258/analytics-hadoop?orgId=1

DONE

Puppet failure on deployment-logstash03.deployment-prep.eqiad.wmflabs

make it stop!!!

Still happening!!!!!!!

https://phabricator.wikimedia.org/T286567#7214310

"those alerts go to all admins of the deployment-prep cloud vps project, production root@ has nothing to do with them"

what to do about this /mnt/hdfs issue

razzi@an-test-coord1001:~$ sudo lsof -Xd DEL | wc
lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.
     81     649    8819

a lotta deleted files still open on an-test-coord1001...?

razzi@an-test-coord1001:~$ sudo lsof -Xd DEL
lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.
COMMAND     PID       USER  FD   TYPE DEVICE SIZE/OFF    NODE NAME
systemd       1       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
fuse_dfs    813       root DEL    REG  253,0          2107668 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libsunec.so
fuse_dfs    813       root DEL    REG  253,0          2107706 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar
fuse_dfs    813       root DEL    REG  253,0          2107712 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jsse.jar
fuse_dfs    813       root DEL    REG  253,0          2107651 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjaas_unix.so
fuse_dfs    813       root DEL    REG  253,0          2107661 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libmanagement.so
fuse_dfs    813       root DEL    REG  253,0          2107663 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libnet.so
fuse_dfs    813       root DEL    REG  253,0          2107664 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libnio.so
fuse_dfs    813       root DEL    REG  253,0          2107690 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/nashorn.jar
fuse_dfs    813       root DEL    REG  253,0          2107685 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/cldrdata.jar
fuse_dfs    813       root DEL    REG  253,0          2107692 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunjce_provider.jar
fuse_dfs    813       root DEL    REG  253,0          2107694 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/zipfs.jar
fuse_dfs    813       root DEL    REG  253,0          2107693 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunpkcs11.jar
fuse_dfs    813       root DEL    REG  253,0          2107689 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/localedata.jar
fuse_dfs    813       root DEL    REG  253,0          2107686 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/dnsns.jar
fuse_dfs    813       root DEL    REG  253,0          2107687 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/icedtea-sound.jar
fuse_dfs    813       root DEL    REG  253,0          2107711 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jfr.jar
fuse_dfs    813       root DEL    REG  253,0          2107718 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar
fuse_dfs    813       root DEL    REG  253,0          2107688 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/jaccess.jar
fuse_dfs    813       root DEL    REG  253,0          2107671 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libzip.so
fuse_dfs    813       root DEL    REG  253,0          2107652 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjava.so
fuse_dfs    813       root DEL    REG  253,0          2107670 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libverify.so
fuse_dfs    813       root DEL    REG  253,0          2107674 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
fuse_dfs    813       root DEL    REG  253,0          2107691 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunec.jar
fuse_dfs    813       root DEL    REG  253,0          1971364 /tmp/hsperfdata_root/813
systemd-l  1057       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
dbus-daem  1069 messagebus DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
lldpd      1070     _lldpd DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
lldpd      1070     _lldpd DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
lldpd      1077     _lldpd DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
lldpd      1077     _lldpd DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
mysqld     6126      mysql DEL    REG   0,17           277921 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277920 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277919 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277918 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277917 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277916 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277915 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277914 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277913 /[aio]
mysqld     6126      mysql DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
mysqld     6126      mysql DEL    REG   0,17           277912 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277911 /[aio]
systemd    9708       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
(sd-pam)   9709       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
systemd   10294      oozie DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
(sd-pam)  10296      oozie DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
rsyslogd  15874       root DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
rsyslogd  15874       root DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
(sd-pam)  29881      razzi DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
kafkatee  33887   kafkatee DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
airflow   38223  analytics DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
airflow   38223  analytics DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5

Ok, unmounting and remounting fixed the jars, the warning is still there

sudo umount /mnt/hdfs
sudo mount -a

homework for you - think about a single cumin command to run the umount/mount commands above on the hosts mounting /mnt/hdfs

Is this going to involve querying for nodes with fuse?

how to make ssh an-coord1001 work without .e?

tbd

what is the deal with product data engineering? We have product analytics and we are data engineering

??

druid administration improvements

add prometheus restarts to cookbook

razzi: there is an extra caveat that I have never had the time to fix, namely the fact that after restarting the clusters a roll restart of the prometheus-druid-exporter services (one for each node) is needed
10:44 
https://github.com/wikimedia/operations-software-druid_exporter#known-limitations
10:44 
so the way that we collect metrics is that we force druid to push (via HTTP POST) metrics to a localhost daemon that exposes them as prometheus exporter
10:45 
when a roll restart happens, the overlord and coordinator leaders will likely change
10:45 
so the past leaders stop emitting metrics, and due to how prometheus works they keep pushing the last value of their metrics (and not zero or null etc..)
10:45 
I believe that some fix to the exporter may resolve this
10:46 
but it has been in my backlog for a long time :)

Add druid cluster docs to wikitech

razzi: analytics is the one used by turnilo superset etc..
10:32 
public is the one that gets only one dataset loaded, the mw history snapshot, and it is called by the AQS api
10:32 
(and it is also not in the analytics VLAN, and has a load balancer in front of it)
10:32 
the cookbook distinguish between the two especially for the pool/depool actions
10:33 
and yes we need to restart both :)
10:33 
also zookeeper on both clusters, that can be done running the zookeeper cookbook (there are options for both druid clusters)

Stuff to look into

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Medium-term_plan_2019/Annual_Plan_2021-2022

How to enable yubikey

Run the command here: https://wikitech.wikimedia.org/wiki/CAS-SSO#Enabling_U2F_for_an_account

sudo modify-mfa --enable UID

Then go to idp.wikimedia.org/ and logout

Then when you log in, it'll prompt and your yubikey will blink; touch it and you're good!

More broccoli (yubikey ssh)

https://wikitech.wikimedia.org/wiki/Yubikey-SSH

an-druid* versus druid*

Which one is "druid public"?

Look at modules/profile/templates/cumin/aliases.yaml.erb and your questions will be answered...

druid-analytics: P{O:druid::analytics::worker}

druid-public: P{O:druid::public::worker}

# Class: role::druid::public::worker

# Sets up the Druid public cluster for use with AQS and wikistats 2.0.

druid::analytics::worker

# Class: role::druid::analytics::worker

# Sets up the Druid analytics cluster for internal use.

# This cluster may contain data not suitable for

# use in public APIs.

Sweet

staff meeting to watch

https://www.youtube.com/watch?v=u8nMYjKg9Yg

unfortunate to have it be on youtube, account switcher disabled if not logged in properly, and distracting "related videos"...

Aug 02 00:15:14 an-launcher1002 java[14538]: unable to create directory '/nonexistent/.cache/dconf': Permission denied. dconf will not work properly.

stop this log spam!

email failing to send

razzi@puppetmaster1001:/srv/private$ ack bsitzmann
modules/secret/secrets/nagios/contacts.cfg
1115:        email                           bsitzmann@wikimedia.org

Should clean this up

release team does training

look into this

could use an article for analytics vlan

todo

Check if a new kernel is installed

ls -l /

razzi@aqs1004:~$ uname -r
4.9.0-13-amd64
razzi@aqs1004:~$ uname -a
Linux aqs1004 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux
razzi@aqs1004:~$ ls -l /

Search all gzipped logs

zgrep linux- /var/log/dpkg.log.*

show http request and response headers

http get https://dumps.wikimedia.org/other/geoeditors/readme.html -p Hh

show info about a systemctl service

razzi@labstore1006:~$ systemctl status analytics-dumps-fetch-geoeditors_dumps.service
● analytics-dumps-fetch-geoeditors_dumps.service - Copy geoeditors_dumps files from Hadoop HDFS.
   Loaded: loaded (/lib/systemd/system/analytics-dumps-fetch-geoeditors_dumps.service; static; vendor preset: enabled)
   Active: inactive (dead) since Thu 2021-08-19 17:53:32 UTC; 36s ago
  Process: 16187 ExecStart=/usr/local/bin/kerberos-run-command dumpsgen /usr/local/bin/rsync-analytics-geoeditors_dumps (code=exited, status=0/SUCCESS)
 Main PID: 16187 (code=exited, status=0/SUCCESS)

How to get docker to run without sudo

sudo adduser razzi docker

and log out

See something like: https://unix.stackexchange.com/questions/6387/i-added-a-user-to-a-group-but-group-permissions-on-files-still-have-no-effect/613608#613608

How to ensure a user actually signed L3?

Go to https://phabricator.wikimedia.org/legalpad/signatures/