Juniper router upgrade
Appearance
Preparation
- List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
- Download the proper image to apt1001:/srv/junos/
- We now only use 64bits vmhost
- Based on upgrade task and Juniper recommended
All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>
- Make room for the image
request system storage cleanup
- If multi-RE, cleanup files on backup RE:
request system storage cleanup re1
- Save rescue config (just in case)
request system configuration rescue save
- Copy image
file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/ routing-instance mgmt_junos
- As data point this takes ~1h15 from eqiad to ulsfo
- Check checksum
file checksum md5 /var/tmp/$filename.tgz
- Compare with checksum on Juniper's website
- Validate new image against existing config
request vmhost software validate /var/tmp/$filename.tgz
Upgrade
- Check if console port(s) is(/are) working
- Depool site (optional but recommended)
- The primary core DC can't easily be depooled (DC-switchover), if a router upgrade is needed in emergency, we have to do a "depool-free" upgrade or get in touch with Service Ops/Traffic.
- Drain traffic away from router
- Set transport links to drained (increase OSPF metrics)
- For each transport link terminating on the router being worked on, set it's Netbox custom field state to
drained
then run Homer on the router and the remote side router of the circuit
- For each transport link terminating on the router being worked on, set it's Netbox custom field state to
- apply GRACEFUL_SHUTDOWN - then wait for ~15min (or check that the devices is not receiving traffic on relevant links, like from the L3 switches or the other routers) - T320230
set protocols bgp graceful-shutdown sender
- Disable the peers
set protocols bgp group Transit4 shutdown
set protocols bgp group Transit6 shutdown
set protocols bgp group IX4 shutdown
set protocols bgp group IX6 shutdown
- Set transport links to drained (increase OSPF metrics)
- Ensure router is not VRRP master (doesn't apply to codfw)
show vrrp summary
set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70
set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
- Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
- Downtime host in Icinga and Alert-manager
sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
- This needs to match the Icinga "hosts",
cr3-ulsfo
will match in AlertManager as well. - NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
- Double check device has been fully drained of traffic before proceeding:
- Check no traffic to LVS if site depooled: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
- Check Cloudflare DDoS tunnels are disabled for site:
sudo cookbook sre.network.cf status all
- Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network
- Check neither CR routers see preferred routes to private* subnets via the one to be upgraded
If Multi RE:
- Remove
graceful-switchover
deactivate chassis redundancy graceful-switchover
request system configuration rescue save
(to ensure graceful-switchover is not in the rescue config)
- Install image on backup RE
request vmhost software add /var/tmp/$filename.tgz re1
- Reboot RE1
request vmhost reboot re1
- Once back up (
show chassis routing-engine
), perform RE switchover (impactful)request chassis routing-engine master switch
- Once done, repeat previous 3 steps for re0
- Rollback "Remove
graceful-switchover
"
If single RE:
- Install image on RE
request vmhost software add /var/tmp/$filename.tgz
- Reboot router
request vmhost reboot
Both single and dual RE:
- Check if router is healthy
show log messages | last
show system alarms
show ospf(3) interface
show bgp summary
- All green in Icinga and LibreNMS
Cleanup
- remove any upgrade leftover files
request system storage cleanup
- If multi-RE, cleanup files on backup RE:
request system storage cleanup re1
- If multi-RE, cleanup files on backup RE:
- Remove Icinga and LibreNMS downtimes
- Rollback "Drain traffic away from router" steps
- OSPF via Netbox
- Disabled BGP peers
- To minimise impact wait for device to learn all routes from transit and IBGP convergence before proceeding
- Remove BGP Graceful shutdown
- Rollback VRRP change if any
- Save rescue config (just in case)
request system configuration rescue save
- On vmhost devices, save the disk snapshot to the backup partition
request vmhost snapshot
for single RE devicesrequest vmhost snapshot routing-engine both
for dual RE devices
- Verify that (little if depooled) traffic flows on the router
- Repool site if depooled