The Apache mod_rewrite Problem
Update 24th August 2009: Apache 2.2.12 includes a B flag to RewriteRule that is meant to be used for this issue. It was introduced in 2.2.7, but broken until 2.2.12, if I read the changelog correctly. Thanks to Michael Stillwell for letting me know.
There is a long-standing problem with using mod_rewrite to give yourself nice URIs, moving data from the path to the query string in the process. On this page, I will outline the problem, strange encounters along the way, and a number of different solutions, depending on your circumstance.
Problem
If we want the URI: http://www.example.org/blog/what_i_did_today to map internally to: http://www.example.org/entry.php?title=what_i_did_today then we can use the simple RewriteRule:
RewriteRule ^/blog/(.*)$ entry.php?title=$1
and everything works fine.
But by the time the URI gets to the RewriteRules, Apache
has already unescaped it, and doesn't re-escape it for you.
So if your URI contains, for example, an ampersand “&” and a plus “+” :
http://www.example.org/blog/A+_mum_&_dad
it will be rewritten to:
http://www.example.org/entry.php?title=A+_mum_&_dad
and so, assuming that your PHP is a normal install, it will think that
the title
variable contains “A _mum_”
(remember + is an encoded space in query strings)
and there is a variable called _dad
with a null value.
Even worse, if your URI contains an escaped hash “#” (we know we have to escape it, otherwise your browser will intercept it as a fragment identifier), then: http://www.example.org/blog/F%23_major becomes: http://www.example.org/entry.php?title=F#_major and PHP (and CGI.pm in a Perl script) ignores everything after the #, setting the title variable to simply “F” - even QUERY_STRING loses the remainder.
False solutions
Banning odd characters
None of that sort of talk round here! ;-)
Double encoding
If you can engineer your URLs to double encode anything important, then you're okay; the URL: http://www.example.org/blog/A%252b_mum_%2526_dad will become: http://www.example.org/entry.php?title=A%2b_mum_%26_dad which is what we want. But these URIs are yucky; we might as well just use the URI it rewrites to!
Using PATH_INFO on its own
Perhaps instead of:
RewriteRule ^/blog/(.*)$ entry.php?title=$1
we could try:
RewriteRule ^/blog/(.*)$ entry.php/title=$1
RewriteMap
The mod_rewrite documentation mentions an escape function that sounds like it helps with this problem, which can be used with the RewriteMap command, if you can edit your server's httpd.conf file (the RewriteMap command can't be used in .htaccess files). Unfortunately, this function is for escaping data in a path, not a query string, and so it leaves &, +, and others totally alone. It does help with some characters such as “#” and “%”, if you know that's all you'll be getting.
Actual Solutions
SCRIPT_URI
If you're running your PHP (or whatever) script as a CGI, then the SCRIPT_URI environment variable contains exactly the data you want, properly unescaped, whichever RewriteRule you are using. So to get the title, we simply need to remove the initial part of the URI:
$title = str_replace('/blog/', '', $_SERVER['SCRIPT_URI']);
RewriteCond with PATH_INFO
Here's the easiest solution, which will work even if you only have access to a .htaccess file:
RewriteCond %{THE_REQUEST} /blog/([^?\ ]+)
RewriteRule ^.*$ index.php/%1
$title = substr($_SERVER['PATH_INFO'], 1);
What's happening here is that THE_REQUEST is an Apache variable containing the entire request line before it has been unescaped by Apache. So we avoid the problem with just PATH_INFO above to do with hashes and question marks.
Using both PATH_INFO and escape together
Alternatively, as hinted at above, as the escape function is for escaping data in a path, we can combine it with PATH_INFO to get another solution. If you can edit your httpd.conf file, you can also use this:
RewriteMap escape int:escape
RewriteRule ^/blog/(.*)$ index.php/${escape:$1}
$title = substr($_SERVER['PATH_INFO'], 1);
Conclusion
I hope I have shown how we can keep nice URLs, containing ampersands and other normal characters, and have that information correctly passed to our script.
In this case, it has been quite simple, as we're only passing one item of data around. If we have more variables, we will have to pass them all in via a RewriteRule (e.g. using the second solution above):
RewriteCond %{THE_REQUEST} /([^?\ ]+)
RewriteRule ^.*$ index.php/%1
And then split it up using a simple loop:
$extra = explode('/', substr($_SERVER['PATH_INFO'], 1));
foreach ($extra as $e) {
$e = explode('=', $e);
$_GET[$e[0]] = $e[1];
}
Then the following URI: http://www.example.org/foo=1/bar=3&4/baz=6 will become: http://www.example.org/index.php/foo=1/bar=3&4/baz=6 and the PHP script can access the variables accordingly.