Jump to content

Juniper router upgrade

From Wikitech

Preparation

  1. List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
  2. Download the proper image to apt1001:/srv/junos/

All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>

  1. Make room for the image
    • request system storage cleanup
    • If multi-RE, cleanup files on backup RE: request system storage cleanup re1
  2. Save rescue config (just in case)
    • request system configuration rescue save
  3. Copy image
  4. Check checksum
    • file checksum md5 /var/tmp/$filename.tgz
    • Compare with checksum on Juniper's website
  5. Validate new image against existing config
    • request vmhost software validate /var/tmp/$filename.tgz

Upgrade

  1. Check if console port(s) is(/are) working
  2. Depool site (optional but recommended)
    1. The primary core DC can't easily be depooled (DC-switchover), if a router upgrade is needed in emergency, we have to do a "depool-free" upgrade or get in touch with Service Ops/Traffic.
  3. Drain traffic away from router
    1. Set transport links to drained (increase OSPF metrics)
      • For each transport link terminating on the router being worked on, set it's Netbox custom field state to drained then run Homer on the router and the remote side router of the circuit
    2. apply GRACEFUL_SHUTDOWN - then wait for ~15min (or check that the devices is not receiving traffic on relevant links, like from the L3 switches or the other routers) - T320230
      • set protocols bgp graceful-shutdown sender
    3. Disable the peers
      • set protocols bgp group Transit4 shutdown
      • set protocols bgp group Transit6 shutdown
      • set protocols bgp group IX4 shutdown
      • set protocols bgp group IX6 shutdown
  4. Ensure router is not VRRP master (doesn't apply to codfw)
    • show vrrp summary
    • set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70    
    • set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
      • Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
  5. Downtime host in Icinga and Alert-manager
    • sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
    • This needs to match the Icinga "hosts", cr3-ulsfo will match in AlertManager as well.
    • NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
  6. Double check device has been fully drained of traffic before proceeding:

If Multi RE:

  1. Remove graceful-switchover
    • deactivate chassis redundancy graceful-switchover
    • request system configuration rescue save (to ensure graceful-switchover is not in the rescue config)
  2. Install image on backup RE
    • request vmhost software add /var/tmp/$filename.tgz re1
  3. Reboot RE1
    • request vmhost reboot re1
  4. Once back up (show chassis routing-engine), perform RE switchover (impactful)
    • request chassis routing-engine master switch
  5. Once done, repeat previous 3 steps for re0
  6. Rollback "Remove graceful-switchover"

If single RE:

  1. Install image on RE
    • request vmhost software add /var/tmp/$filename.tgz
  2. Reboot router
    • request vmhost reboot

Both single and dual RE:

  1. Check if router is healthy
    • show log messages | last
    • show system alarms
    • show ospf(3) interface
    • show bgp summary
    • All green in Icinga and LibreNMS

Cleanup

  1. remove any upgrade leftover files
    • request system storage cleanup
      • If multi-RE, cleanup files on backup RE: request system storage cleanup re1
  2. Remove Icinga and LibreNMS downtimes
  3. Rollback "Drain traffic away from router" steps
    1. OSPF via Netbox
    2. Disabled BGP peers
      1. To minimise impact wait for device to learn all routes from transit and IBGP convergence before proceeding
    3. Remove BGP Graceful shutdown
  4. Rollback VRRP change if any
  5. Save rescue config (just in case)
    • request system configuration rescue save
  6. On vmhost devices, save the disk snapshot to the backup partition
    • request vmhost snapshot for single RE devices
    • request vmhost snapshot routing-engine both for dual RE devices
  7. Verify that (little if depooled) traffic flows on the router
  8. Repool site if depooled