RPKI
What is RPKI
RPKI is a security framework by which network owners can validate and secure the critical route updates or Border Gateway Protocol (BGP) announcements between public Internet networks.(Telia - What is RPKI, Wikipedia)
Wikimedia servers
See the Puppet role "rpkivalidator" applied to rpki* nodes in the operations/puppet repo.
Signing
Prefixes
All our advertised prefixes have matching ROAs, for AS14907 with an exact length.
The full list is visible on https://stat.ripe.net/widget/as-routing-consistency#w.resource=AS14907
They are setup through the RIR's hosted RPKI platforms.
Monitoring
BGPmon Network monitoring#RPKI Validation Failed
RIPE Network monitoring#Resource Certification (RPKI) alerts
Validation
Tracking task: https://phabricator.wikimedia.org/T220669
Gerrit changes: https://gerrit.wikimedia.org/r/q/topic:%22rpki%22+(status:open%20OR%20status:merged)
VMs: https://netbox.wikimedia.org/virtualization/virtual-machines/?q=rpki (Routinator requirements)
Grafana: https://grafana.wikimedia.org/d/UwUa77GZk/rpki
Current status
In production, reject RPKI invalid prefixes on all external BGP sessions (transit and peering).
Packaging
Currently we are using the upstream .deb package from the Routinator project, which we import into our local 'thirdparty' repo.
Router config
First we need the routers to talk to the Validators:
routing-options { [...] validation { group rpki { session 2620:0:861:103:10:64:32:19 { port 3323; } session 2620:0:860:101:10:192:0:103 { port 3323; } } } }
We set up some named community entries, covering the extended communities defined in RFC8097:
policy-options { community RPKI:ALL members "^0x4300:0.0.0.0:[0-9]+$"; community RPKI:INVALID members 0x4300:0.0.0.0:2; community RPKI:UNKNOWN members 0x4300:0.0.0.0:1; community RPKI:VALID members 0x4300:0.0.0.0:0; }
We create a policy statement which will set the local validation state of routes based on there status in the validation database, i.e. from Routinator. This policy also adds the appropriate extended community string for each:
policy-statement BGP_rpki { term valid { from { protocol bgp; validation-database valid; } then { validation-state valid; community add RPKI:VALID; } } term invalid { from { protocol bgp; validation-database invalid; } then { validation-state invalid; community add RPKI:INVALID; } } term unknown { from { protocol bgp; validation-database unknown; } then { validation-state unknown; community add RPKI:UNKNOWN; } } } }
This policy is referenced in a term on the various policies we apply to external BGP sessions (transit, peering etc.) For instance it's referenced in "BGP_transit_in", causing the validation state and community to be added to those routes inbound:
policy-options { [...] policy-statement BGP_transit_in { [...] term rpki-classification { from policy BGP_rpki; } [...] }
A separate policy then rejects/drops routes with the the RPKI:INVALID community:
policy-statement BGP_community_actions { term rpki-invalids { from community RPKI:INVALID; then reject; } [...] }
These policies are applied in sequence to BGP groups for external peers. For example: protocols {
bgp { group Transit4 { [...] import [ BGP_sanitize_in BGP_transit_in BGP_community_actions ]; } }
}
We also set the validation state for prefixes exchanged on iBGP (internal) sessions:
policy-statement iBGP_rpki { term valid { from community RPKI:VALID; then validation-state valid; } term invalid { from community RPKI:INVALID; then validation-state invalid; } term unknown { from community RPKI:UNKNOWN; then validation-state unknown; } }
How-to
Identify if an issue is due to invalid RPKI
- Enter the IP of the user reporting an issue in https://stat.ripe.net/widget/prefix-routing-consistency.
- Focus in particular in the rows that have YES for the
In RIS
column, as those are the ones advertised in the DFZ. - If the emoji is red, then the IP is originating from a RPKI invalid prefix or length. Hover over the face to have more details.
- If the prefix or IP is not covered by a less specific prefix (see image) then it will not be able to be routed back to the client.
- The content of a specific ROA can be found at https://rpki-validator.ripe.net/roas. Filter for a specific prefix and verify that the ASN matches and the prefix length is smaller or equal to the defined max length.
- In that case, reach out to the provider so they fix their ROA, or disable validation (less preferred).
Perform a manual RPKI validation
- SSH into one of the RPKI servers (
rpki[12]001
as of Feb. 2020) - Query the local daemon for the validity of a prefix for an ASN (replace the values of the parameters):
$ curl "http://localhost:9556/validity?asn=99999999&prefix=10.0.0.0/22"
Example output for a prefixlen mismatch
{
"validated_route": {
"route": {
"origin_asn": "AS99999999",
"prefix": "10.0.0.0/24"
},
"validity": {
"state": "Invalid",
"reason": "length",
"description": "At least one VRP Covers the Route Prefix, but the Route Prefix length is greater than the maximum length allowed by VRP(s) matching this route origin ASN",
"VRPs": {
"matched": [
],
"unmatched_as": [
],
"unmatched_length": [
{
"asn": "AS99999999",
"prefix": "10.0.0.0/22",
"max_length": "22"
}
] }
}
}
}
In this case it shows that the maximum length for the prefix to be announces is set to be 22
but the advertised subnet is a /24
, hence invalid.
Example output for an ASN mismatch
{
"validated_route": {
"route": {
"origin_asn": "AS99999999",
"prefix": "10.0.0.0/22"
},
"validity": {
"state": "Invalid",
"reason": "as",
"description": "At least one VRP Covers the Route Prefix, but no VRP ASN matches the route origin ASN",
"VRPs": {
"matched": [
],
"unmatched_as": [
{
"asn": "AS11111111",
"prefix": "10.0.0.0/22",
"max_length": "22"
}
],
"unmatched_length": [
] }
}
}
}
In this case it shows that the ROA specifies AS11111111
as the authorized ASN to advertise the prefix, but the prefix is advertised by AS99999999
(the one passed to the cURL query), hence invalid. The advertising ASN can be taken from the RIPE stat website linked above.
Disable validation
If validation is causing any issue and must be quickly disabled, stopping Routinator would not work, as by default the routers will keep the validator data in cache for 1h.
On the router side, you can either (depending on scope):
- Disable all validation:
deactivate routing-options validation
- Set a static override: see bellow
Set a static override (exception)
- Add the exception to Homer, see example
- Run Homer on target routers
Monitoring
In the unlikely event of both validators not working, there would be NO outage. We would loose the ability to enforce RPKI.
Most likely not something worth waking up people.The Prometheus alerts live in the operations/alerts repository.
RPKI to router port
- See below to check if the process is running
- Check if the port (3323) is open in iptables
- Check if routinator listens on the port (
sudo netstat -nlpt | grep routinator
) - Test port from a monitoring host (eg.
nc -zv <hostname> <port>
) - Open a task, cc netops/traffic
Process
Troubleshot it like most processes:
sudo systemctl status routinator
- Routinator logs to syslog, check logstash or
journalctl -u routinator
- Try to re-start it
sudo systemctl restart routinator
- Open a task, cc netops/infrastructure foundations
Prometheus alerts
Valid ROAs decreasing
A possible cause is that Routinator can't download the new ROAs from the repositories
- Check the logs for signs of rsync failure (eg.
rsync rpki.ripe.net/repository: rsync: mkstemp "/var/lib/routinator/repository/rpki.ripe.net[...]CAi" failed: Permission denied (13)
) - try to manually run the rsync from a temporary directory
- Ensure the server have connectivity to the internet (e.g. check the proxies)
It's also possible that the issue is with the RPKI infrastructure, in that case it's non actionable for us.
RSYNC status
This alert will trigger if Routinator can't reach an important number of RSYNC servers, showing an issue reaching out to the Internet.
Look at the logs for more information on the failure.
Try to run the RSYNC manually, from a host not behind a proxy to rule out the proxies.
RRDP status
This alert will trigger if Routinator can't reach an important number of RRDP servers, showing an issue reaching out to the Internet.
RRDP uses HTTPS to fetch ROAs. So the error code will be an HTTP error code.
-1 means that the request timed out.
As all Routinator instance fetch from the same source you can compare them to know if the issue is most likely on our side or on the remote side.
Then try to manually curl to that endpoint from various vantage points to pinpoint where the issue is.
RTR Connections drop
This alert will trigger if Routinator loses connectivity to an important number of routers.
- Check that Routinator is healthy and reachable (see RTR port above)
Possible future work
- Add monitoring on the routers side. Currently only screen scraping/netconf seems doable (no SNMP).
- Encrypt the RTR traffic. Not a blocker as it's not PII and it's not leaving our infrastructure. Not supported on Junos.
- Implement mechanism to easily add exceptions.
Resources
Routinator's doc: https://rpki.readthedocs.io/en/latest/routinator/index.html
Juniper's doc: https://www.juniper.net/documentation/en_US/junos/topics/topic-map/bgp-origin-as-validation.html
Blog post: https://phabricator.wikimedia.org/phame/post/view/186/rpki_origin_validation/