Network telemetry
We are now exporting certain statistics from our network devices using the GNMI protocol, which exposes data based on device YAML models over gRPC. We use the gNMIc tool to connect to routers and "subscribe" to the gRPC paths we are interested in, and expose the collected stats through a Prometheus endpoint.
The short term goal is to complement LibreNMS for some metrics (exposed or not by SNMP), as well as provide more real time data (LibreNMS have a 5min granularity) for critical metrics.
Long term goal is to replace LibreNMS.
Example dashboard
https://grafana-rw.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/network-interface-throughput
https://grafana-rw.wikimedia.org/d/f61a7d56-e132-44dc-b9da-d722b11566cf/network-totals-by-site
https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-queue-and-error-stats
Infrastructure
Network devices (exporters)
Use the sre.network.tls cookbook to create or update the TLS certificate.
netflow VMs (collectors)
gNMIc is the cornerstone of this pipeline. It connects to the network devices in its area of influence (eg. same site) asks them to send it relevant metrics, optionally mangles them, then exposes them as a Prometheus endpoint.
Configuration
The configuration of gnmic is driven from Puppet, using class profile::gnmi_telemetry, which is currently enabled for the "netinsights" role applied to our netflow VMs.
The gnmic configuration on the VMs itself is a YAML file, built directly from the data in our Puppet repo at hieradata/common/profile/gnmi_telemetry.yaml. The configuration has four main elements:
Targets
These are the devices to connect to and what subscriptions to enable for each.
The list is generated from the profile::netbox::data::network_devices
Hiera key, which is generated from Netbox using the sre.puppet.sync-netbox-hiera
cookbook.
If a device is missing, make sure its status is set to ACTIVE
in Netbox and the cookbook has been run. Also that you're looking at the collector local to the device you're looking for.
Subscriptions
The subscriptions define the gnmi path(s) to subscribe to on network device targets, and the parameters to use for the connection. Key params we set include:
sample-mode: This can be "sample" or "on change". Sample means the data will be sent periodically by the network device regardless, "on change" means it will only send data for a particular metric when its value changes. While the latter may be more optimal it is non-trivial to integrate with Prometheus (see T369384). There is also an argument that can be used with "on-change", called "heartbeat-interval", which can be used to request stats are sent every N seconds even if they haven't changed. That might be the perfect balance for some cases, however our Juniper devices do not support it, reporting "not supported by system" if it is requested. So we use sample-mode in all cases currently.
encoding: This is set to 'proto' to use protobuf encoding.
sampling-interval: We have the sampling-interval set to 60 seconds, which matches how often the Prometheus servers connect to gnmic to pull in the data. In general these two values should always match (no point gnmic getting data more frequently than we will put it in the db or vice versa).
paths: These reference the various metrics and configuration elements on a device, based on the YANG model for it (vendor specific or generic). For instance we subscribe to the "/interfaces/interface/state" path to get statistics on device interfaces. Vendors may provide lists of valid paths in the correct format, or they can be derived with a little effort from the supported YANG definitions.
Outputs
Outputs can be configured to export the data gnmic is collecting to external systems. We have one enabled, the prometheus output. When enabled this causes gnmic to run a local web server and provide the measurements received from network device subscriptions as prometheus metrics over HTTP. Some important options we have set here are:
export-timestamps: Setting this to true ensures the prometheus metrics are exported with the timestamp of when the data was received from the router. This ensures it is entered in the Prometheus database with the correct time rather than the time it was requested by Prometheus.
timeout: This sets an upper-limit to how long the gnmic process will spend returning prometheus metrics when requested over HTTP. It defaults to 10 seconds but as the number of metrics grew this was not enough in codfw, and we had to increase it to allow all metrics to be sent within the limit. In general this value should always be set to match the scrape_interval configured on the prometheus server for collection.
UPDATE: Since disabling gNMIc caching event-processors are now run when stats are received from routers, rather than when Prometheus makes a scrape request. The result is the scrape-time has gone way down below 10 seconds everywhere again (Jan 2025).
num-workers: This controls the number of threads that are used by the output and related processors. Currently set to 8 threads as we experienced some small gaps in data with a lower number. More threads seem to help, even beyond the number of CPU cores on the machine running gnmic.
Processors
Event Processors can be configured on a specific output in order to process data before it is exported. We use a variety of these to normalise data so it is more useful to us in Prometheus.
Event Value Tag v2
This processor can be used to take the value returned at a specific RPC path, and add it as a new tag to all other metrics with the same set of tags. For instance use it to take an interface description returned on path '/interfaces/interface/state/description' and make this a Prometheus tag for all the interface metrics. The 'v2' version was added to support our use case in a more performant way by the gnmic devs (see github).
Event Strings
This processor allows various string-manipulation to be carried out on metric names, value names etc, before the information is exported. Typical string operations such as replace, trim, split etc are available.
One place we need to use this is when a string value is returned for a status from a device, which we wish to store in Prometheus. A good example of this are the BGP Neighbour States, which are returned as strings like "IDLE" and "ESTABLISHED". Prometheus can only support numeric values for metrics, so these can't be used as-are. Instead we can use the "replace" transform to match on each and use a numeric value to represent the state instead. A little messy but required for some items.
Event Delete
You can also use event-delete to completely remove some data before it is exported. For instance we use it to remove redundant tags gnmic adds otherwise, and to remove metrics which we have converted to tags on others and don't want to export themselves.
Prometheus Configuration
The Prometheus configuration is defined in the same way as any other scraping job within in our ops configuration at each site.
A key value we configure here is the scrape_timeout attribute. This needs to be long enough to allow all the metrics be served by the gnmic ouput. In general this value should always match the timeout configured for the gnmic prometheus output.
Monitoring the monitoring
Network devices gNMI endpoint monitoring
We use two different monitoring for the gNMI endpoints.
First the Prometheus blackbox exporter for that purpose, however the gRPC check doesn't work with gNMI ("rpc error: code = Unavailable desc = JGrpcServer: Unknown RPC /grpc.health.v1.Health/Check received"). This is why we use a custom TCP check that also verifies the TLS certificate for expiration.
Note that the device's TLS certificate doesn't include the whole chain (see also https://phabricator.wikimedia.org/T375513), so we have to pass the network root CA as parameter to the check.
Probe results are available in https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&var-job=probes%2Fgrpc&var-module=All&var-site=All and automatically benefit from the existing alerting.
If a target on this page (or similar) says "DOWN" it means blackbox can't establish a TCP handshake : https://prometheus-eqiad.wikimedia.org/ops/targets?scrapePool=probes%2Fgrpc&search=
If any issue, it's also possible to filter for "service.name:gnmi_connect" in https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2
Then (and more recently) gNMIc exports its subscriptions status as Prometheus metrics, which can be monitored in the dashboard below.
We should eventually settle on a single monitoring do prevent duplicated efforts.
gNMIc monitoring
We also collect gNMIc health data from its dedicated (API) Prometheus endpoint.
gNMIc health dashboard : https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=All
Troubleshooting
Get the currently exposed TLS cert
openssl s_client -showcerts -connect <fqdn>:<port> 2>/dev/null | openssl x509 -text
Validate the currently exposed TLS cert
From the (netflow) host running gnmic:
openssl s_client -showcerts -connect <fqdn>:<port> 2>/dev/null | openssl x509 | tee /tmp/device.pem
openssl verify -CAfile /etc/ssl/localcerts/network_devices_bundle.pem /tmp/device.pem
Get the currently exposed Prometheus metrics
prometheus1005:~$ curl netflow1002:9804/metrics
Run gnmic manually (debug mode)
You can stop the gnmic systemd service and run the service in the foreground based on our current config as follows:
sudo service gnmic stop && sudo -u gnmic /usr/local/bin/gnmic --config /etc/gnmic.yaml subscribe --debug
--log
instead of --debug
will be less verbose.
You can also manually run gnmic to do a one-off connection and request a specific metric, using "format event" here is optional but is useful to see the event data that we run processors on:
sudo -u gnmic /usr/local/bin/gnmic sub -d --format event -a "lsw1-e2-eqiad.mgmt.eqiad.wmnet" --port 32767 --tls-ca "/etc/ssl/localcerts/network_devices_bundle.pem" --encoding json --mode once -u rancid --password "<password>" --path "/interfaces/interface[name="et-0/0/55"]/state"
Show juniper's gRPC deamon's status
show extension-service request-response servers
Check Juniper "Analytics Agent" is running correctly on a target device
It seems sometimes gnmic cannot subscribe to stats for a device, with errors like this shown if it's run in debug mode:
subscription interfaces-states rcv error: rpc error: code = Unavailable
This can occur if the JunOS "analytics agent" (agentd) service isn't working correctly. You can see if this is the case by running:
show agent sensors
The system should return a list of sensors and information about them to this command, if it doesn't that is likely the issue. To solve you can restart that service:
restart analytics-agent gracefully
Current limitations
gNMI is not supported on SRX300 (management routers) and EX4300 (some management switches).
Future improvements
- On Junos, once all devices are running > 22.2 uses the device's PKI stack instead of storing the key/cert as a text blob -
use-pki
in https://www.juniper.net/documentation/us/en/software/junos/interfaces-telemetry/topics/ref/statement/ssl-edit-system-services-grpc-jet.html - In the
sre.network.tls
cookbook use thetimeout
parameter once cumin hosts are running python >= 3.10 to speed things up - Expose more metrics, like BGP sessions
- Alert on gNMIc's subscriptions once all the virtual-chassis are replaced with devices supporting gNMI
- Build and package gnmic - https://phabricator.wikimedia.org/T347461
History
https://phabricator.wikimedia.org/T369384 - Productionize gnmic network telemetry pipeline (current tracking task)
https://phabricator.wikimedia.org/T326322 - Add per-output queue monitoring for Juniper network devices - (initial task)
https://phabricator.wikimedia.org/T334594 - TLS certificates for network devices
External links
Github issue we raised in which devs add new processors to support our use case with better performance.
Github issue in which the gnmic maintainers discuss some performance optimisations and how best to structure the config.
https://phabricator.wikimedia.org/phame/post/view/304/multi-platform_network_configuration/