Posts Tagged ‘email address obfuscation’

Convert email addresses in source HTML to images without modifying the source?

Thursday, August 7th, 2008

Yes, you can with PHP and Apache 2.2.7 and higher!

The goal of this project was to output web pages with images of email addresses so that the source didn’t contain any information that a spam bot could harvest – but at the same time allow the source code to contain standard markup

<a href="mailto:somebody@somewhere.com">somebody@somewhere.com</a>

or even stand-alone email addresses like

somebody@somewhere.com

without any additional modifications.

Sound like a contradiction?  Maybe, but it can be done quite easily!

Email Spam Harvester Background
So if you are here reading this post, you should already know about spam bots and some of the ways they harvest email addresses.  If not, here is a great summary, study and discussion of various methods to obfuscate email addresses.  Unfortunately, the study didn’t involve images of email addresses, however images of email addresses have been proven to be very difficult for spam bots to process without advanced OCR (optical character recognition), which slows down the bots considerably.

Images of Email Addresses
Creating images of email addresses to spam-proof web pages is not a new concept.  And it has been proven to be very successful.  However, it does have some drawbacks.

  1. Usability - An image of an email address should be just that, and image.  Not a fully functional mailto link that contains the image of the email address.  The mailto link will contain the email address and just like that, the spam bots will also have the email address.  This means that an email address must be retyped in the email client.  Because the email address is an image, it cannot even be copy/pasted into the email client.
  2. Accessibility - What if the user is vision impaired?  Showing them an image on the screen isn’t going to help!  However, I work around this accessibility issue by providing an ALT attribute to the image tag which contains a munged version of the email address (ie somebody AT somewhere DOT com) and this way aural browsers can still read out the email address.
  3. Ease of use – The image of the email address must somehow be generated.  Most web authors are not going to want to manually create images, especially when their site may contain hundreds if not thousands of email addresses.Scripts can also be used to dynamically create images of email addresses, but how do you relay the email address to the script from the source code without revealing it to spam bots?  There are some methods, one of which is used in the technique described in this post.Even if you do use server-side scripting to generate images of email addresses, most likely changes will need to be made at the source level. That isn’t easy for people who use content management systems and programs like Adobe Contribute to do distributed web authoring and may know little to nothing about HTML markup, let alone a scripting language like PHP or ASP.

In this example, I use a simple PHP script to convert passed URL variables which contain various parts of the email address into an image image that the script outputs.

Apache mod_substitution
Apache’s mod_substitute provides a mechanism to perform both regular expression and fixed string substitutions on response bodies.  This is the key peice of functionality to pull off a seamless, on-the-fly transformation of source code that spam bots could take advantage of to source code that will make the page better protected from spam bots.

Putting it All Together
So now that we’ve covered the two main peices to the puzzle (the Apache regular expression substitution and the PHP image generation routine, let’s bring it all together.

  1. Apache – We’ll start with the Apache changes first.  You only need to add three configuration lines.The first is making sure you are loading mod_substitute into Apache:
    LoadModule substitute_module modules/mod_substitute.so

    Otherwise, the next three lines will cause an error when you restart Apache.

    AddOutputFilterByType SUBSTITUTE text/html
    Substitute "s!<a\shref\s?=\s?\"\s?mailto\s?:\s?(.*)@(.*)\.(.*)\"?\b[^>]*>(.*?)</a>!<a href=\"mailto:\"><img src=\"/email_test/email.php?m=$1\&amp;h=$2\&amp;tld=$3\" align=\"middle\" border=\"0\" alt=\"$1 AT $2 DOT $3\" /></a>!i"
    Substitute "s!(\b[A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4}\b)!<a href=\"mailto:\"><img src=\"/email_test/email.php?m=$1\&amp;h=$2\&amp;tld=$3\" align=\"middle\" border=\"0\" alt=\"$1 AT $2 DOT $3\" /></a>!i"

    The first line tells Apache to use this substitution for content-types of text/html.

    The next two lines are where all of the work is done.  They are regular expressions that match a standard mailto link and a stand-alone email address and replaces them with a call to the PHP script that will convert the email address into an image.   The replace part of the regular expression replaces the entire <a> tag (if present) with a different <a> tag which also contains an <img> tag that calls the PHP script.  The href attribute of the new <a> tag only contains “mailto:”.  This will allow the image to be clickable and bring up an empty email window.  Notice that finishing the mailto link would be counter-productive as it would give away the email address to the spam bots (this is one of the usability issues I mentioned above)!  However, to get around accessibility issues created by using an image to convey and email address, the ALT attribute of the img tag contains the munged version of the email address (somebody AT somewhere DOT com) so that aural browsers can still read out the email address!  The email address is broken up into it’s various parts by the regular expression and passes as URL variables to the PHP script.  You can use the two configuration lines above inside of a <Directory> or <Location> tag in httpd.conf, or you can use them inside an .htaccess file as well, which is great for everyone running there sites on a shared server (again, make sure the hosting provider has mod_substitute loaded in their Apache installation)!

    That’s it for the Apache configuration side of things!

  2. PHP - I’ve developed a small PHP script which will reform the email address from the URL variables passed to it and output it as an image:
    /email_test/email.php:

    <?
    $size = ($_GET['size']) ? $_GET['size'] : 10; //This will be the font size of the image - change as desired
    $font = $_SERVER['DOCUMENT_ROOT'].'/includes/fonts/ttf/arial.ttf';  //Use whatever TTF font you desire
    $email = $_GET['m'].'@'.$_GET['h'].'.'.$_GET['tld']; //Reconstructs the email address from the URL varables
    $bb = imagettfbbox($size, 0, $font, $email);
    $w = $bb[2] - $bb[0];
    $h = $bb[1] - $bb[7];
    $h += 7; //This height adjustment helps put the image of the email address centered vertically with inline text
    header("Content-type: image/png"); //Outputs as PNG, but can be GIF or JPG also
    $im = imagecreate($w, $h);
    $white = imagecolorallocate($im, 255, 255, 255);  //This will be the background color of the image - change as desired
    $blue = imagecolorallocate($im, 0, 0, 255);  //This will be the text color of the image - change as desired
    imagettftext($im, $size, 0, 0, $size, $blue, $font, $email);
    imagepng($im);
    imagedestroy($im);
    ?>

    The great thing about this method is that as long as the PHP script is configured right, you can use whatever URL variable names that you would like to keep the spam bots from catching on.  I used ‘m’ for mailbox, ‘h’ for host, and ‘tld’ for top-level domain.  You can also obfuscate the parts of the email address however you like by modifying the replace part of the regular expression in Apache.  Just remember to use an equivilant decoding technique in the PHP script!

Results
So what does that leave us with?  Basically a source HTML file that looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<p>My email address is <a href="mailto:somebody@somewhere.com">somebody@somewhere.com</a> at my organization.</p>
<p>This is a random email address without any links in the original source code - nobody@nobody.com</p>
</body>
</html>

And an output stream from the web server to the client that looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<p>My email address is <a href="mailto:"><img src="/email_test/email.php?m=somebody&amp;h=somewhere&amp;tld=com" align="middle" border="0" alt="somebody AT somewhere DOT com" /></a> at my organization.</p>
<p>This is a random email address without any links in the original source code - <a href="mailto:"><img src="/email_test/email.php?m=nobody&amp;h=nobody&amp;tld=com" align="middle" border="0" alt="nobody AT nobody DOT com" /></a></p>
</body>
</html>

Also notice that the output all validates correctly!

Working Example
And that’s all there is to it!  You can see this method in action here!

The Implications
For starters, you can use any email image generation script you want, you are not limited to the basic one that I’ve provided.  But more importantly, the replacement argument of the regular expression doesn’t have to use an <img> tag with a link to a PHP image generator.  It can be code for any email address obfuscation method you’d like to use!  Feel free to replace it with JavaScript code, CSS encoding methods, or whatever you can think up.

And all of this can be done with absolutely no modification of source code!  How much time will that save you?

I look forward to your comments!