Jump to content

URL path normalization

From Wikitech

The Wikimedia CDN performs URL path normalization. This page documents why and how.

Problem

The following two URLs are different from the HTTP point of view for the browser and for the CDN software. But these represent the same page title in MediaWiki:

Pages that contain parentheses or other special characters in their titles, have more than one valid URL:

  • one with literal parentheses,
  • one with parentheses URL-encoded,
  • variations that contain a mix of the two.

After a page is edited, MediaWiki purges the canonical URL, which uses literal characters where possible. If a URL-encoded variant was also cached, it would not get purged, and become stale. Unless, we do something to prevent this!

In Varnish, before any cache lookup, we convert the incoming URL to an internally deterministic representation. The same conversion must also happen during purge requests, so that if Steve_Fuller_%28sociologist%29 was cached, a purge to Steve_Fuller_(sociologist) invalidates that object.

The question is: given a URL, which characters should be encoded (eg: !0x21), which hex escape should be decoded (eg: 0x7e~), and which characters/hex escapes should be left untouched?

Theory

RFC 3986 section 2 splits the 256 possible byte values completely into 3 sets: Unreserved, Disallowed and Reserved:

  • 66 Unreserved: 0-9 A-Z a-z - . _ ~
  • 172 Disallowed: 0x00-0x20 0x7F-0xFF < > | { } " % \ ^ `
  • 18 Reserved: : / ? # [ ] @ ! $ & ' ( ) * + , ; =

"Unreserved" and "Disallowed" characters do not present any issue from the point of view of choosing an internal representation. We decode unreserved, and encode disallowed.

Troubles begin when "Reserved" characters are used. According to RFC3986 section 2.2:

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.

Application-specific knowledge is required to choose what to do with Reserved characters:

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

For example, use of query strings are common in web applications, including MediaWiki. This means the question mark and ampersand serve as separators. The encoded version of these have a different meaning than their encoded meaning.

https://example.org/foo?from=Apple&to=Banana
File path: "/foo"
Query string: [ { from: "Apple" }, { to: "Banana" } ]

Compared to:

https://example.org/foo?from=Have_I_Been_Pwned%3F&to=Penn+%26+Teller
File path: "/foo"
Query string: [ { from: "Have I Been Pwned?" }, { to: "Penn & Teller" } ]

Similarly, the forward slash / could be a literal to act as directory separator, or encoded as %2F to be used as part of a file name. In MediaWiki, both encodings are supported when used in pageview URLs.

$ varnishlog -q 'ReqMethod eq "PURGE" and ReqURL ~ "Profiling_Python"' | grep ReqURL &
$ curl -X PURGE 'http://127.0.0.1:3128/wiki/User:Ema%2fProfiling_Python'
-   ReqURL         /wiki/User:Ema%2fProfiling_Python
-   ReqURL         /wiki/User:Ema/Profiling_Python

This is notably different from the RESTBase application which must the literal slash as a separator in virtual REST routes, and the encoded form as part of a resource or page title which could otherwise not be represented (T127387):

Without application-specific knowledge, the following rules should be followed:

  1. Unreserved hex escapes should always be decoded: 0x7e~
  2. Disallowed characters should be encoded to their hex escape representation: >0x3e
  3. Reserved character (and their hex escape representation) should be left as-is

With application-specific knowledge, we can carefully normalize Reserved characters too. We operate specifically on the URL path, before any ? or # (which should be left unchanged). The remaining 16 characters in the Reserved set can be encoded/decoded with application-specific knowledge. We'll call this set of characters the Customizable set:

: / [ ] @ ! $ & ' ( ) * + , ; =

When it comes to MediaWiki, all 16 characters in the Customizable set can be put into a specific subset that is either always-decoded or always-encoded, giving us "complete" normalization:

mediawiki_decode_set = : / @ ! $ ( ) * ,
mediawiki_encode_set = [ ] & ' + = ;

RESTBase is similar to MediaWiki, but needs to accept MediaWiki titles with slashes in the %2F form, while still keeping its own functional path-delimiting slashes unencoded as mentioned earlier.

restbase_decode_set = : @ ! $ ( ) * , ;
restbase_encode_set = [ ] & ' + =

When it comes to upload.wikimedia.org, MediaWiki simply uses PHP's rawurlencode() when generating storage URLs. We thus need all characters in the Customizable set to be encoded. The two subsets are thus:

upload_decode_set =
upload_encode_set = : / [ ] @ ! $ & ' ( ) * + , ; =

Implementation

The problem has been solved for ATS using a Lua script, normalize-path.lua. Path normalization behavior can be configured by specifying which characters to encode and which to decode as a remap rule via hiera. Characters need to be specified in hex. For example:

    - type: map
      target: http://upload.wikimedia.org
      replacement: https://swift.discovery.wmnet
      params:
          - '@plugin=/usr/lib/trafficserver/modules/tslua.so'
          - '@pparam=/etc/trafficserver/lua/normalize-path.lua'
          # decode    /
          - '@pparam="2F"'
          # encode    !  $  &  '  (  )  *  +  ,  :  ;  =  @  [  ]
          - '@pparam="21 24 26 27 28 29 2A 2B 2C 3A 3B 3D 40 5B 5D"'
          - '@plugin=/usr/lib/trafficserver/modules/tslua.so'
          - '@pparam=/etc/trafficserver/lua/x-mediawiki-original.lua'

In Varnish, we deal with the issue by using a C function embedded in VCL, normalize_path_encoding(). Behavior can be changed by passing different "decoder rings" to the function. For example:

sub normalize_upload_path { C{
    static const size_t upload_decoder_ring[256] = {
      // 0x00-0x1F (all unprintable)
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      //  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
        0,0,0,2,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,2,
      //@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,
      //` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ <DEL>
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,
      // 0x80-0xFF (all unprintable)
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    };

    normalize_path_encoding(ctx, upload_decoder_ring);
}C }

Alternatives

Redirect to the canonical URL

Browsers have various heuristics to "help" their users by automatically encoding or normalizing input from the address bar to produce a URL that is compliant with the WHATWG URL standard, with extra changes to correct or prevent common mistakes (such as escaping spaces to %20 or collapsing /foo/../bar to /bar). These heuristics differ between browsers. When a web server becomes very specific about enforcing and redirecting toward a specific encoding, this can create a redirect loop.

The majority of past attempts, issues, research, and future considerations around this are summarised and aggregated in T106793: Pages with single quote in title are inaccessible (redirect loop). See also: T105265: Redirecting "~" to %7E causes a redirect loop in Chrome.

See also