The Apache mod_rewrite Problem

Update 24th August 2009: Apache 2.2.12 includes a B flag to RewriteRule that is meant to be used for this issue. It was introduced in 2.2.7, but broken until 2.2.12, if I read the changelog correctly. Thanks to Michael Stillwell for letting me know.

There is a long-standing problem with using mod_rewrite to give yourself nice URIs, moving data from the path to the query string in the process. On this page, I will outline the problem, strange encounters along the way, and a number of different solutions, depending on your circumstance.

Problem

If we want the URI: http://www.example.org/blog/what_i_did_today to map internally to: http://www.example.org/entry.php?title=what_i_did_today then we can use the simple RewriteRule:

RewriteRule ^/blog/(.*)$ entry.php?title=$1

and everything works fine.

But by the time the URI gets to the RewriteRules, Apache has already unescaped it, and doesn't re-escape it for you. So if your URI contains, for example, an ampersand “&” and a plus “+” : http://www.example.org/blog/A+_mum_&_dad it will be rewritten to: http://www.example.org/entry.php?title=A+_mum_&_dad and so, assuming that your PHP is a normal install, it will think that the title variable contains “A _mum_” (remember + is an encoded space in query strings) and there is a variable called _dad with a null value.

Even worse, if your URI contains an escaped hash “#” (we know we have to escape it, otherwise your browser will intercept it as a fragment identifier), then: http://www.example.org/blog/F%23_major becomes: http://www.example.org/entry.php?title=F#_major and PHP (and CGI.pm in a Perl script) ignores everything after the #, setting the title variable to simply “F” - even QUERY_STRING loses the remainder.

False solutions

Banning odd characters

None of that sort of talk round here! ;-)

Double encoding

If you can engineer your URLs to double encode anything important, then you're okay; the URL: http://www.example.org/blog/A%252b_mum_%2526_dad will become: http://www.example.org/entry.php?title=A%2b_mum_%26_dad which is what we want. But these URIs are yucky; we might as well just use the URI it rewrites to!

Using PATH_INFO on its own

Perhaps instead of:

RewriteRule ^/blog/(.*)$ entry.php?title=$1

we could try:

RewriteRule ^/blog/(.*)$ entry.php/title=$1

– simply using PATH_INFO instead of the QUERY_STRING, and splitting it into keys and values ourselves. Unfortunately this doesn't help with the “#” situation, and annoyingly this URI: http://www.example.org/blog/why%3f_who_knows! will become: http://www.example.org/entry.php/title=why?_who_knows! leading to a PATH_INFO of “/title=why” and a QUERY_STRING of “_who_knows!”

RewriteMap

The mod_rewrite documentation mentions an escape function that sounds like it helps with this problem, which can be used with the RewriteMap command, if you can edit your server's httpd.conf file (the RewriteMap command can't be used in .htaccess files). Unfortunately, this function is for escaping data in a path, not a query string, and so it leaves &, +, and others totally alone. It does help with some characters such as “#” and “%”, if you know that's all you'll be getting.

Actual Solutions

SCRIPT_URI

If you're running your PHP (or whatever) script as a CGI, then the SCRIPT_URI environment variable contains exactly the data you want, properly unescaped, whichever RewriteRule you are using. So to get the title, we simply need to remove the initial part of the URI:

$title = str_replace('/blog/', '', $_SERVER['SCRIPT_URI']);

RewriteCond with PATH_INFO

Here's the easiest solution, which will work even if you only have access to a .htaccess file:

RewriteCond %{THE_REQUEST} /blog/([^?\ ]+)
RewriteRule ^.*$ index.php/%1

The PATH_INFO will now contain what we want:

$title = substr($_SERVER['PATH_INFO'], 1);

What's happening here is that THE_REQUEST is an Apache variable containing the entire request line before it has been unescaped by Apache. So we avoid the problem with just PATH_INFO above to do with hashes and question marks.

Using both PATH_INFO and escape together

Alternatively, as hinted at above, as the escape function is for escaping data in a path, we can combine it with PATH_INFO to get another solution. If you can edit your httpd.conf file, you can also use this:

RewriteMap escape int:escape
RewriteRule ^/blog/(.*)$ index.php/${escape:$1}

Then again the PATH_INFO variable will hold the data:

$title = substr($_SERVER['PATH_INFO'], 1);

Conclusion

I hope I have shown how we can keep nice URLs, containing ampersands and other normal characters, and have that information correctly passed to our script.

In this case, it has been quite simple, as we're only passing one item of data around. If we have more variables, we will have to pass them all in via a RewriteRule (e.g. using the second solution above):

RewriteCond %{THE_REQUEST} /([^?\ ]+)
RewriteRule ^.*$ index.php/%1

And then split it up using a simple loop:

$extra = explode('/', substr($_SERVER['PATH_INFO'], 1));
foreach ($extra as $e) {
    $e = explode('=', $e);
    $_GET[$e[0]] = $e[1];
}

Then the following URI: http://www.example.org/foo=1/bar=3&4/baz=6 will become: http://www.example.org/index.php/foo=1/bar=3&4/baz=6 and the PHP script can access the variables accordingly.