HTTPS/Unified Certificates
These are the primary multi-wildcard-SAN certificates that not only serve our Traffic clusters but also serve other Wikimedia functions such as Fundraising. They have a number of unique properties operationally:
- Highly important - these certificates terminate the bulk of all of our important live user-facing traffic.
- High SAN counts + Wildcards - We have all canonical domains in these certs as SANs, wildcarded at the domain level and the m-dot level, as well as a few other odds and ends. All total the current SAN count is 29, and most of those are wildcards.
- Broad deployment - These certs deploy to all Traffic edge nodes in all datacenters, so deployment/synchronization issues are a little trickier than smaller services with one to a handful of hosts.
- Redundancy - Because we use OCSP Stapling which relies on the upstream certificate providers' OCSP infrastructure reliability in near-realtime, we purchase and deploy redundant copies of these certificates from two different vendors, plus also from LetsEncrypt.
Certificate Vendor Deployment and Switching on Failure
We have had upstream OCSP failures affect us in the past: Incident_documentation/20150820-OCSP Incident_documentation/20161013-GlobalSign. Our plan for future OCSP incidents is to switch all datacenters to whichever vendor's certificates are not having OCSP issues.
Our current vendors are Digicert and LetsEncrypt. Our standard deployment of these today is to use the Digicert certificates in our non-US datacenters and LetsEncrypt in the US datacenters, so that both are known-good by servicing live user traffic. All of the certificate vendors are deployed to the filesystems of all edge hosts at all datacenters, and OCSP staple-fetching occurs for them all from all hosts at all times as well. Switching which certificate is in active use at a given edge datacenter is just a matter of proxy reconfiguration driven by hieradata:
$ git grep public_tls_unified_cert_vendor
hieradata/codfw.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
hieradata/drmrs.yaml:public_tls_unified_cert_vendor: "digicert-2022"
hieradata/eqiad.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
hieradata/eqsin.yaml:public_tls_unified_cert_vendor: "digicert-2022"
hieradata/esams.yaml:public_tls_unified_cert_vendor: "digicert-2022"
hieradata/ulsfo.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
To switch in an emergency:
- Merge a puppet commit changing all of the above hieradata settings to reference the remaining functional vendor.
- Run the puppet agent all cacheproxy hosts via cumin, e.g.
sudo cumin A:cp 'run-puppet-agent -q'
Sometimes, OCSP staleness alerts are firing due to a now-resolved issue with the certificate vendor's infrastructure. In this case with a manually-issued vendor such as Digicert, you can manually trigger an OCSP refresh with:
sudo -i cumin -b1 'A:cp-eqiad' "/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all"
For LetsEncrypt certificate OCSP issues, see Acme-chief documentation
Validation
Wikimedia's domains must to be validated by the issuing certificate authority before they will issue a unified certificate. Presently, Wikimedia uses email-based verification.
A bad actor can impersonate WMF and just-as-easily alter the TXT records as they could redirect the email should they gain control of Wikimedia's DNS. Therefore, email verification isn't particularly harmful in this case. Future use of TXT records will be implemented not for security but for convenience of the renewal process.
.
To validate the domains for the unified certificate via email:
- Notify the appropriate teams in the appropriate channels of the impending verification emails that will be sent.
- Follow the official documentation for verifying emails, noting:
- Not all DCV administrative email addresses that are suggested should be used (e.g. admin@, webmaster@). Use hostmaster@.
- It's possible that a new domain has not been set up for email routing. If that's the case, either create a patch setting up email routing or create a patch setting the validation TXT record, verify, then revert.
- Once the domains have been validated, renew the certificate using the official documentation as a guideline, noting:
- The CSR will include the CN but not any of the SANs. The SANs will be added automatically via the web interface, wihch is pre-filled.
- Use the puppet master server to generate a CSR using the existing domain keys that live under
/srv/private/modules/secret/secrets/ssl
:# openssl req -new -key <current_key>.key -out server.csr
Storage
In the past, certificates and private keys were stored on cache hosts in different /etc/ subfolders (depending on the certificate vendor).
To increase security against private key retrieval while the server is offline, we decided to move both the private keys and the chained certificates used by HAProxy to a dedicated tmpfs storage.
Since this storage is deleted and recreated at every server restart, we must ensure that HAProxy doesn't attempt to restart without valid keys/certificates present. The process, from a high-level perspective, is:
- Cache host boots
systemd-tmpfiles
service creates the custom tmpfs directory- Puppet agent runs, downloads certificates into the tmpfs directory, and starts the HAProxy service
In case HAProxy attempts to start before the first Puppet run, an ExecStartPre
script checks that the certificates are actually valid (and implicitly that the tmpfs directory exists). If this check fails, the script prevents HAProxy from starting.
When Puppet runs, it also attempts to start the HAProxy service, assuming all required dependencies are satisfied.
Since this has been rolled out incrementally, there's a hiera key that controls this behavior: profile::cache::haproxy::use_tls_tmpfiles
. To rollback to the previous configuration (certificates stored on non-volatile filesystem) this must be set to false
AND ensure that every certificate under the profile::cache::haproxy::available_unified_certificates
data structure uses the correct path (not the volatile one). In case these two conditions aren't both met, the puppet module will fail()
and refuses to bring certificates on the host. This also mean that it's not possible to have only part of the TLS material on volatile storage and other on non-volatile storage.
See also
- phab:T230687#5422646 - Context for the existence of the unified certificate
- Wikipedia: Certificate signing request
- phab:T384227 - Use volatile storage for private TLS material