You may have encountered the name "mod_rewrite" before
when surfing the web. For all of our readers who are
not intimately familiar with this nifty Apache Web
Server module - and, of course, for those who don't
know it all - we are presenting this small
introductory tutorial as a multipart serial.
Module mod_rewrite is a package of program routines
which can be added to the Apache Web Server.
(Note that it will not run under other web servers!)
Its primary function is the manipulation of URLs.
The module is very versatile as we are going to
illustrate here with a number of real world examples.
However, be very careful and meticulous when working
with it! Some mistakes you might be liable to make
could generate a logical loop, causing a never-ceasing
100% CPU load.
To steer clear from this, we will start off with some
very simple examples.
Before we can get going, however, you will have to
check whether the module is installed on your web
server at all.
There are several ways to go about this:
1. Ask your system administrator - provided he or she
knows. They really should, but unfortunately some
plain do not ...
Take care, though: if you are sharing your host
server with hundreds of other domains, your inquiry
might rouse some sleeping dogs, as usage of
mod_rewrite will always entail some increased CPU
load.
2. Check your Apache configuration file if you can
access it. One possible standard path might be:
/etc/httpd/httpd.conf
However, your mileage may obviously vary.
3. Check it out with one of the following examples.
If it works fine, mod_rewrite is indeed installed
on your system. If it isn't, you will get the
following message when calling any web page of your
choice: "Internal Server Error"
Also, you will see this entry in file "error.log":
"Invalid command 'RewriteEngine', perhaps mis-spelled
or defined by a module not included in the server
configuration."
If your site generates heavy traffic, this method
is not recommended, as every visitor will receive
this very same error message during your test.
So now let's dig into our first practical example!
We will assume that you will be using mod_rewrite
only for your own web site, i.e. not as a generalized
cross server setup.
To effect this, some entries in file .htaccess are
required.
The .htaccess File
------------------
For this technique to work, you will need to upload
a file named ".htaccess" (please note the period/dot
at the beginning of the file name!) to your server
directory.
This can be done via telnet or ftp.
(Warning! .htaccess should only be uploaded in "ASCII
mode", i.e. not in binary mode!)
If you already have a ".htaccess" file, for example
one with the following entries:
Options Includes +ExecCGI
AddType text/x-server-parsed-html .html
simply add our code sample to it.
-----------------------------------
IMPORTANT!
----------
ADJUSTMENTS IN FILE ".htaccess":
please edit in ASCII or plain text
editor like Notepad etc.
-----------------------------------
The first two entries will start the module:
RewriteEngine on
Options +FollowSymlinks
Tip: Entry "RewriteEngine off" will override all
subsequent commands. This is a very useful feature:
instead of having to comment out all subsequent
lines, all you need to do is set an "off".
If your system administrator does not allow for
implementation of "Options +FollowSymlinks", you will
not be able to restrict usage of mod_rewrite to
your directories but will instead have to apply it
server wide.
The next required entry is this:
RewriteBase /
"/" stands for the base URL. Should you have another
one, you will want to include it. However, "/" is
normally the entry for "http://www.YourDomain.com".
And now to the entries proper!
Let us assume that you want to block unauthorized
access to your file .htaccess. On some servers
you can easily read this file simply by entering a URL
of the following format in your browser's address
field:
http://www.domain.com/.htaccess - a serious
security gap, as your .htaccess file's contents may
reveal more about your site's setup to the educated
eye than you may want others to know.
To block this access, enter the following:
RewriteRule ^\.htaccess$ - [F]
This rule translates to:
If someone tries to access file .htaccess, system
shall generate error code "HTTP response of 403".
The file name ^\.htaccess$ is contained in a regular
expression, to wit:
^ Start of line anchor
$ End of line anchor
\. In regular expressions the dot "." denotes a
meta character and must be protected by a
backslash (\) if you want an actual dot (period)
instead.
The file name must be located exactly between start
and end of line anchor. This will ensure that only
this specific file name and no other will generate
the error code.
[F] : special flag "forbidden".
In this example, the complete ".htaccess" file will
now consist of these lines:
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteRule ^\.htaccess$ - [F]
If we add our code to a pre-existing ".htaccess" file,
we might, for example, get the following entries:
Options Includes +ExecCGI
AddType text/x-server-parsed-html .html
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteRule ^\.htaccess$ - [F]
This introduction covers the basics required to
operate with mod_rewrite.
In the second part of this tutorial we will explain
the use of conditions in configuring the module.
You may check up general documentation here:
--------------------------------------------
Module mod_rewrite URL Rewriting Engine:
http://www.apache.org/docs/mod/mod_rewrite.html
A Users Guide to URL Rewriting with the
Apache Webserver:
http://www.engelschall.com/pw/apache/rewriteguide/
In this tutorial's last instalment we started off with
a discussion of the basics of Module mod_rewrite. In
the example reviewed there we made use of a rule
which, put in full words, states:
"If access to file .htaccess is attempted, return
an error message stating that access is denied."
This rule is valid globally, i.e. everyone will
receive the specified error message.
We can, however, restrict a rule by what is termed
"rule conditions" - in this case, the rule will only
be executed if the condition set has actually been
met.
Syntax: The condition must precede the rule!
Let us explain this procedure with an example.
(The lines below are entries in file ".htaccess".)
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon
RewriteRule ^.*$ - [F]
The first three lines were covered in detail in Part 1
of this tutorial. Their function is to initialize the
rewriting engine.
The last two lines will refuse access to a spider
carrying UserAgent "EmailSiphon". This specific
spider is an email harvester culling addresses from
web pages.
Our line:
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon
is made up of the following three parts:
Directive: RewriteCond
TestString: %{HTTP_USER_AGENT}
CondPattern: ^EmailSiphon
The TestString is a server variable which
is written in the general form of
"%{NAME_OF_VARIABLE}".
In our example we have defined the "HTTP_USER_AGENT"
as "NAME_OF_VARIABLE".
CondPattern is a regular expression.
Before we continue with its specifics, let us take a
look at regular expressions and their function in
general.
Regular expressions
-------------------
Regular expressions are a means of describing text
patterns. They are used to check if a text pattern is
present in any given text. Once determined, this
pattern can then be manipulated.
Regular expressions are similar to a small, compact
programming language in its own right.
E.g. the regular expression "s/abc/xyz/g" will
globally replace the string "abc" in a text by "xyz".
Here is an overview of the most important elements
with some examples:
.(dot) - text (any character)
| - alternation (i.e. /abc|def/)
* - quantifier (any number is allowed)
^ $ - line anchors
s - operator (string1 to be replaced by string2)
g - modifier (search parses the whole text)
Regular expressions are construed with the help of
these elements and alphanumeric characters.
Regular expressions are not used isolated by
themselves; instead, they are integrated in other
tools, e.g. in languages like Perl or in text editors
such as Emacs.
In connection with Module mod_rewrite they are used in
the directives RewriteRule and RewriteCond.
"^" represents the beginning of a string. It follows
that the UserAgent must begin with string
"EmailSiphon" and nothing else. ("NewEmailSiphon", for
example, would not work.) In this case the condition
would not be met.
But as this particular regular expression doesn't
contain the character "$" (end of line anchor), the
UserAgent could, for example, be "EmailSiphon2".
The last script line
RewriteRule ^.*$ - [F]
defines what will happen when a spider is requesting
access.
The regular expression "^.*$" signifies:
If access to any file is requested, the error message
"forbidden" will be displayed.
The dot "." in the regular expression is a meta symbol
(wildcard) and signifies any random character.
"*" signifies that the string may occur an unlimited
number of times. In this case, regardless which
specific page is called, an error message will be
displayed.
EmailSiphon is, of course, not the only email
harvester. Another famous member of this family is
"ExtractorPro".
So let's say we want to fend off this spider as well.
In this case we will require another condition to be
met.
This gives us the following entries to file ".htaccess":
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro
RewriteRule ^.*$ - [F]
The third argument ([OR]) in line:
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
is termed a "flag". In regard to conditions there
exist two possible flags:
NC (no case)
OR (or next condition)
Flag "NC" permits case insensitive testing of the
condition pattern.
Example
-------
RewriteCond %{HTTP_USER_AGENT} ^emailsiphon [NC]
This line specifies that both "emailsiphon" and
"EmailSiphon" shall be recognized.
If you wish to use multiple flags, you may delimit
them by commas.
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro
There are no restrictions to the number of conditions.
Thus, you could block 10, 100, 1000 or more
established email harvesters. Defining these 1000
conditions is merely a question of server performance
and of ".htaccess" transparency.
In the above example, the string "HTTP_USER_AGENT"
is being used.
Further server variables are:
REMOTE_HOST
REMOTE_ADDR
For example, if you want to block the spider comming
from <
http://www.cyveillance.com >, you will use variable
"REMOTE_HOST". Thus:
RewriteCond %{REMOTE_HOST} ^www\.cyveillance\.com$
RewriteRule ^.*$ - [F]
The dot "." in the domain name must be protected
by "\" (backslash), otherwise it would be handled like
any other meta character.
If you want to block any given IP, the condition will
read:
RewriteCond %{REMOTE_ADDR} ^216\.32\.64\.10$
RewriteRule ^.*$ - [F]
In the regular expression, enter the IP in its
entirety, delimited by the line anchors.
You may even exclude a whole IP range from access:
RewriteCond %{REMOTE_ADDR} ^216\.32\.64\.
RewriteRule ^.*$ - [F]
This example will cover all individual IPs from
"216.32.64.0" through "216.32.64.255".
Here's a little teaser quiz for you to check out your
skills. (The solution will be featured in the next
part of our tutorial.) Enjoy!
RewriteCond %{REMOTE_ADDR} ^216\.32\.64
RewriteRule ^.*$ - [F]
Quiz question:
--------------
If we don't write "^216\.32\.64\." for a regular
expression in the configuration above, but
"^216\.32\.64" instead, will we get the identical
effect, i.e. will this exclude the same IPs?
Up until now we have used a simple RewriteRule
which will generate an error message. In the 3rd part
of our tutorial we will analyze how RewriteRule may be
used to redirect visitors to specific files.
In the two preceding parts of this tutorial we
explained the basics of Rules and Conditions.
We will now follow up with two examples to illustrate
their use for somewhat more complex applications.
The first example deals with dynamicall generated pages
while the second example will cover calling up ".txt"
files.
For our first example, let's assume that you want to
sell several items of merchandise on your web site.
Your clients are guided to various detailed product
descriptions via a script:
http://www.yoursite.com/cgi-bin/shop.cgi?product1
http://www.yoursite.com/cgi-bin/shop.cgi?product2
http://www.yoursite.com/cgi-bin/shop.cgi?product3
These URLs are included as links on your site.
If you want to submit these dynamic pages to the
search engines, you are confronted with the problem
that most of them will not accept URLs containing
the "?" character.
However, it would be perfectly possible to submit an
URL of the following format:
http://www.yoursite.com/cgi-bin/shop.cgi/product1
Here, the "?" character has been replaced by "/".
Yet more pleasing to the eye would be a URL of this
type:
http://www.yoursite.com/shop/product1
To the search engine, this appears to be just another
acceptable hyperlink, with "shop" presenting a directory
containing files "product1", "product2", etc.
If a visitor clicks this link on a search engine's
results page, this URL must be reconverted to make sure
that "shop.cgi?product1" will actually be called.
To this effect we will make use of mod_rewrite with the
following entries:
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteRule ^(.*)shop/(.*)$ $1cgi-bin/shop.cgi?$2
The variables $1 and $2 constitute so-called
"backreferences". These are related to text groups.
Everything called in the clicked URL which is located
before "shop" plus everything following "shop/" is
defined by and stored in the two variables $1 and $2
Up to this point our given examples made use of rules
such as this one:
RewriteRule ^.htaccess*$ - [F]
However, we did not yet achieve a true rewrite in the
sense that one URL would be switched to another.
For the entry in our current example:
RewriteRule ^(.*)shop/(.*)$ $1cgi-bin/shop.cgi?$2
this general syntax applies:
RewriteRule currentURL rewrittenURL
As you can see, this command executes a real rewrite.
In addition to installing the ".htaccess" file,
all links in your normal HTML pages which follow the
format "cgi-bin/shop.cgi?product" must be changed to:
"shop/product" (without the quotes).
When a spider visits a normal HTML page of this kind
it will also follow or crawl the product links because
there is no question mark contained in the link anymore
to prevent it from doing so.
So employing this method you can convert dynamically
generated product descriptions into seemingly static
web pages and feed them to the search engines.
---------
In our second example we will discuss how to
redirect calls for ".txt" files to a program script.
Many webspace providers running Apache will feature
system log files only in common format. What this means
is that these logs will not store visitor Referrers and
UserAgents.
However, in relation to "robots.txt" calls it is
preferable to have access to this information in order
to learn more about visiting spiders than merely their
IPa.
To effect this, the entries in ".htaccess" should be as
follows:
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteRule ^\robots.txt$ /text.cgi?%{REQUEST_URI}
Now, when "robots.txt" is called, the applied Rule
will redirect your visitor to the program script
"text.cgi".
Furthermore, a variable is conveyed to the script which
will be processed by the program.
"REQUEST_URI" defines the name of the file you expect
to be called. In out example this is "robots.txt".
The script will now read the contents of "robots.txt"
and will forward them to the web browser or the search
engine spider.
Finally, the visitor hit is archived in the log file.
To this effect, the script will pull the environmental
variables "$ENV{'HTTP_USER_AGENT'}" etc. This will
provide the required information.
Here is the source code for the cgi script mentioned
above:
<BEGIN SOURCE CODE>
#!/usr/bin/perl
# If required, adjust line above to point to Perl 5.
######################################################
# (c) Copyright 2000 by fantomaster.com #
# All rights reserved. #
######################################################
$stats_dir = "stats";
$log_file = "stats.log";
$remote_host = "$ENV{'REMOTE_HOST'}";
$remote_addr = "$ENV{'REMOTE_ADDR'}";
$user_agent = "$ENV{'HTTP_USER_AGENT'}";
$referer = "$ENV{'HTTP_REFERER'}";
$document_name = "$ENV{'QUERY_STRING'}";
open (FILE, "robots.txt");
@TEXT = <FILE>;
close (FILE);
&get_date;
&log_hits
("$date $remote_host $remote_addr $user_agent $referer $document_name\n");
print "Content-type: text/plain\n\n";
print @TEXT;
exit;
sub get_date {
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime();
$mon++;
$sec = sprintf ("%02d", $sec);
$min = sprintf ("%02d", $min);
$hour = sprintf ("%02d", $hour);
$mday = sprintf ("%02d", $mday);
$mon = sprintf ("%02d", $mon);
$year = scalar localtime;
$year =~ s/.*?(\d{4})/$1/;
$date="$year-$mon-$mday, $hour:$min:$sec";
}
sub log_hits {
open (HITS, ">>$stats_dir/$log_file");
print HITS @_;
close (HITS);
}
<END SOURCE CODE>
To install the script, upload it to your web site's
main or DocumentRoot directory by ftp and change file
permissions to 755.
Next, create the directory "stats".
If your server's configuration does not permit
execution of Perl or CGI scripts in the main directory
(DocumentRoot), you may wish to try the following
RewriteRule instead:
RewriteRule ^\robots.txt$ /cgi-bin/text.cgi?%{REQUEST_URI}
Note, however, that in this case you will have to
modify the paths accordingly in the program script!
-------
Finally, here's the solution to our quiz from the
previous issue of fantomNews:
=================================================
RewriteCond %{REMOTE_ADDR} ^216\.32\.64
RewriteRule ^.*$ - [F]
Quiz question:
--------------
If we don't write "^216\.32\.64\." for a regular
expression in the configuration above, but
"^216\.32\.64" instead, will we get the identical
effect, i.e. will this exclude the same IPs?
=================================================
The regular expression ^216\.32\.64
will apply e.g. to the following strings:
216.32.64
216.32.640
216.32.641
216.32.64a
216.32.64abc
216.32.64.12
216.32.642.12
Hence, "4" may be followed by any character string.
However, IP addresses can only have the maximal value
255.255.255.255 - which implies that e.g.
216.32.642.12 is not a valid IP.
The only valid IP in the list above is 216.32.64.12!
Although the two regular expressions "^216\.32\.64\."
and "^216\.32\.64" allow for different strings, due to
the technical limitation of IP addresses to 0-255 this
range of IPs will remain excluded.
Special Directives and Examples
In this final part of our tutorial we will take
a look at those special directives we haven't covered
yet.
These directives cannot be defined on directory level.
This means that you will have to be able to edit the
Apache webserver's configuration file (httpd.conf).
These permissions will usually only be assigned to
users "root" or "admin".
If you wish to log all operations effected by
mod_rewrite you can activate logging with the
following entries:
RewriteLog /usr/local/apache/logs/mod_rewrite_log
RewriteLogLevel 1
These entries are not written into the file
".htaccess" but in "Section 2: 'Main' server
configuration" of file "httpd.conf".
All mod_rewrite manipulations will be logged
in this file. The log file can have any name you
prefer. It can be referenced as an absolute path or
relative to ServerRoot.
If you wish to maintain separate log files for
individual virtual hosts, you will have to place the
pertinent entries in "Section 3: Virtual Hosts",
e.g.:
<VirtualHost 192.168.1.1>
ServerAdmin
webmaster@yourdomain.com
DocumentRoot /usr/www/htdocs/yourdomain
ServerName yourdomain.com
RewriteLog /usr/apache/logs/yourdomain_mod_rewrite_log
RewriteLogLevel 1
</VirtualHost>
(Note: If your email reader or browser wraps these
lines take care to enter them unwrapped in your file!)
The RewriteLogLevel can be defined within a range of
1 to 8. Normally, 1 will do fine. Higher levels are
only required for debugging purposes.
--------
Another directive which is very handy for cloaking
purposes are the so-called Rewriting Maps. These are
files consisting of key/value pairs, e.g. in the
simple format of an ordinary text file:
cde2c920.infoseek.com spider
205.226.201.32 spider
cde2c923.infoseek.com spider
205.226.201.35 spider
cde2c981.infoseek.com spider
205.226.201.129 spider
cde2cb23.infoseek.com spider
205.226.203.35 spider
These keys are, as you can see, hostnames or IPs.
In this simplistic example the value is always the
same, namely "spider".
This directive is entered either in the server
section 2 or in the virtual host section 3 in file
"httpd.conf":
RewriteMap botBase txt:/www/yourdomain/spiderspy.txt
The Rewriting Map will then be available across your
server.
The other directives are entered in file ".htaccess":
RewriteCond ${botBase:%{REMOTE_HOST}} =spider [OR]
RewriteCond ${botBase:%{REMOTE_ADDR}} =spider
RewriteRule ^(.*)\.htm$ $1.htm [L]
RewriteRule ^.*\.htm$ index.html [L]
The conditions will make the system check whether the
required access is generated by a spider. To this
effect a lookup of file "spiderspy.txt" is triggered.
If the key is found, the value "spider" is returned
and the condition is rendered as true.
Next, the first RewriteRule will be executed. This one
determines that the called for ".htm" page will be fed
to the spider. The variable $1 is equal to the part in
parentheses of "^(.*)\.htm$", i.e. the file name will
remain the same.
If the URL is called by a normal human visitor, rule 2
applies: the user will be redirected to page
"index.html".
As the ".htm" pages will only be read by spiders, they
can be optimized accordingly for the search engines.
You may also use a file in dbm format instead of an
ordinary text file. The binary data base format helps
accelerate the lookup which is particularly important
if you are operating from very large spider lists.
This example given above offers a simple cloaking
functionality. All ordinary visitors will always be
redirected to the site's "index.html" page and there
is no access logging beyond the mod_rewrite logs.
However, it does go to show how you can effectively
replace several lines of Perl code with just a few
lines of mod_rewrite.
Our last example will illustrate this in some greater
detail.
----
The objective is to present site visitors with your
"Picture of the Day". Visitors will click a link, e.g.:
<
http://www.yourdomain.com/pic.html >
which will display a different picture every day.
We will work from these server variables:
TIME_MON
TIME_DAY
In file ".htaccess" we will enter the following
single code line:
RewriteRule ^pic.html$ pic-%{TIME_MON}-%{TIME_DAY}.html
(Note: If your email reader or browser wraps this line
take care to enter it unwrapped in your file!)
The URL called for will be rewritten, e.g. to:
pic-08-28.html
pic-08-29.html
pic-08-30.html
etc.
So all you have to do is upload the pertinent files
once, after which you won't need to tend to their
daily assignation anymore.
Obviously the time variables can also be used for
other periodicities.
------
With this final example our mod_rewrite tutorial has
come to its end.
Of course, we have not tackled each and every
directive, variable, etc. here.
Rather, we suggest you view this tutorial as a general
introduction intended to help you as a start off point
towards a more in-depth study of the mod_rewrite
module, enabling you to customize it according to your
specific requirements.
------------------------------------------------------
[Main text: 749 words/5122 characters]
======================================================
This text may freely be republished or distributed
provided the following resource box is included intact
either at the beginning or the end of the article and
a complimentary copy or notice (link) is sent to the
author at the address specified below:
------------------------------------------------------
Dirk Brockhausen is the co-founder and principal of
fantomaster.com Ltd. (UK) and fantomaster.com GmbH
(Belgium), a company specializing in webmasters
software development, industrial-strength cloaking and
search engine positioning services. He holds a
doctorate in physics and has worked as an SAP
consultant and software developer since 1994. He is
also Technical Editor of fantomNews, a free newsletter
focusing on search engine optimization, available at:
<
http://fantomaster.com/fantomnews-sub.html >
You can contact him at
mailto:fntecheditor@fantomaster.com
(c) copyright 2000 by fantomaster.com
------------------------------------------------------