Archive for August, 2008

Convert email addresses in source HTML to images without modifying the source?

Thursday, August 7th, 2008

Yes, you can with PHP and Apache 2.2.7 and higher!

The goal of this project was to output web pages with images of email addresses so that the source didn’t contain any information that a spam bot could harvest – but at the same time allow the source code to contain standard markup

<a href="mailto:somebody@somewhere.com">somebody@somewhere.com</a>

or even stand-alone email addresses like

somebody@somewhere.com

without any additional modifications.

Sound like a contradiction?  Maybe, but it can be done quite easily!

Email Spam Harvester Background
So if you are here reading this post, you should already know about spam bots and some of the ways they harvest email addresses.  If not, here is a great summary, study and discussion of various methods to obfuscate email addresses.  Unfortunately, the study didn’t involve images of email addresses, however images of email addresses have been proven to be very difficult for spam bots to process without advanced OCR (optical character recognition), which slows down the bots considerably.

Images of Email Addresses
Creating images of email addresses to spam-proof web pages is not a new concept.  And it has been proven to be very successful.  However, it does have some drawbacks.

  1. Usability - An image of an email address should be just that, and image.  Not a fully functional mailto link that contains the image of the email address.  The mailto link will contain the email address and just like that, the spam bots will also have the email address.  This means that an email address must be retyped in the email client.  Because the email address is an image, it cannot even be copy/pasted into the email client.
  2. Accessibility - What if the user is vision impaired?  Showing them an image on the screen isn’t going to help!  However, I work around this accessibility issue by providing an ALT attribute to the image tag which contains a munged version of the email address (ie somebody AT somewhere DOT com) and this way aural browsers can still read out the email address.
  3. Ease of use – The image of the email address must somehow be generated.  Most web authors are not going to want to manually create images, especially when their site may contain hundreds if not thousands of email addresses.Scripts can also be used to dynamically create images of email addresses, but how do you relay the email address to the script from the source code without revealing it to spam bots?  There are some methods, one of which is used in the technique described in this post.Even if you do use server-side scripting to generate images of email addresses, most likely changes will need to be made at the source level. That isn’t easy for people who use content management systems and programs like Adobe Contribute to do distributed web authoring and may know little to nothing about HTML markup, let alone a scripting language like PHP or ASP.

In this example, I use a simple PHP script to convert passed URL variables which contain various parts of the email address into an image image that the script outputs.

Apache mod_substitution
Apache’s mod_substitute provides a mechanism to perform both regular expression and fixed string substitutions on response bodies.  This is the key peice of functionality to pull off a seamless, on-the-fly transformation of source code that spam bots could take advantage of to source code that will make the page better protected from spam bots.

Putting it All Together
So now that we’ve covered the two main peices to the puzzle (the Apache regular expression substitution and the PHP image generation routine, let’s bring it all together.

  1. Apache – We’ll start with the Apache changes first.  You only need to add three configuration lines.The first is making sure you are loading mod_substitute into Apache:
    LoadModule substitute_module modules/mod_substitute.so

    Otherwise, the next three lines will cause an error when you restart Apache.

    AddOutputFilterByType SUBSTITUTE text/html
    Substitute "s!<a\shref\s?=\s?\"\s?mailto\s?:\s?(.*)@(.*)\.(.*)\"?\b[^>]*>(.*?)</a>!<a href=\"mailto:\"><img src=\"/email_test/email.php?m=$1\&amp;h=$2\&amp;tld=$3\" align=\"middle\" border=\"0\" alt=\"$1 AT $2 DOT $3\" /></a>!i"
    Substitute "s!(\b[A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4}\b)!<a href=\"mailto:\"><img src=\"/email_test/email.php?m=$1\&amp;h=$2\&amp;tld=$3\" align=\"middle\" border=\"0\" alt=\"$1 AT $2 DOT $3\" /></a>!i"

    The first line tells Apache to use this substitution for content-types of text/html.

    The next two lines are where all of the work is done.  They are regular expressions that match a standard mailto link and a stand-alone email address and replaces them with a call to the PHP script that will convert the email address into an image.   The replace part of the regular expression replaces the entire <a> tag (if present) with a different <a> tag which also contains an <img> tag that calls the PHP script.  The href attribute of the new <a> tag only contains “mailto:”.  This will allow the image to be clickable and bring up an empty email window.  Notice that finishing the mailto link would be counter-productive as it would give away the email address to the spam bots (this is one of the usability issues I mentioned above)!  However, to get around accessibility issues created by using an image to convey and email address, the ALT attribute of the img tag contains the munged version of the email address (somebody AT somewhere DOT com) so that aural browsers can still read out the email address!  The email address is broken up into it’s various parts by the regular expression and passes as URL variables to the PHP script.  You can use the two configuration lines above inside of a <Directory> or <Location> tag in httpd.conf, or you can use them inside an .htaccess file as well, which is great for everyone running there sites on a shared server (again, make sure the hosting provider has mod_substitute loaded in their Apache installation)!

    That’s it for the Apache configuration side of things!

  2. PHP - I’ve developed a small PHP script which will reform the email address from the URL variables passed to it and output it as an image:
    /email_test/email.php:

    <?
    $size = ($_GET['size']) ? $_GET['size'] : 10; //This will be the font size of the image - change as desired
    $font = $_SERVER['DOCUMENT_ROOT'].'/includes/fonts/ttf/arial.ttf';  //Use whatever TTF font you desire
    $email = $_GET['m'].'@'.$_GET['h'].'.'.$_GET['tld']; //Reconstructs the email address from the URL varables
    $bb = imagettfbbox($size, 0, $font, $email);
    $w = $bb[2] - $bb[0];
    $h = $bb[1] - $bb[7];
    $h += 7; //This height adjustment helps put the image of the email address centered vertically with inline text
    header("Content-type: image/png"); //Outputs as PNG, but can be GIF or JPG also
    $im = imagecreate($w, $h);
    $white = imagecolorallocate($im, 255, 255, 255);  //This will be the background color of the image - change as desired
    $blue = imagecolorallocate($im, 0, 0, 255);  //This will be the text color of the image - change as desired
    imagettftext($im, $size, 0, 0, $size, $blue, $font, $email);
    imagepng($im);
    imagedestroy($im);
    ?>

    The great thing about this method is that as long as the PHP script is configured right, you can use whatever URL variable names that you would like to keep the spam bots from catching on.  I used ‘m’ for mailbox, ‘h’ for host, and ‘tld’ for top-level domain.  You can also obfuscate the parts of the email address however you like by modifying the replace part of the regular expression in Apache.  Just remember to use an equivilant decoding technique in the PHP script!

Results
So what does that leave us with?  Basically a source HTML file that looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<p>My email address is <a href="mailto:somebody@somewhere.com">somebody@somewhere.com</a> at my organization.</p>
<p>This is a random email address without any links in the original source code - nobody@nobody.com</p>
</body>
</html>

And an output stream from the web server to the client that looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<p>My email address is <a href="mailto:"><img src="/email_test/email.php?m=somebody&amp;h=somewhere&amp;tld=com" align="middle" border="0" alt="somebody AT somewhere DOT com" /></a> at my organization.</p>
<p>This is a random email address without any links in the original source code - <a href="mailto:"><img src="/email_test/email.php?m=nobody&amp;h=nobody&amp;tld=com" align="middle" border="0" alt="nobody AT nobody DOT com" /></a></p>
</body>
</html>

Also notice that the output all validates correctly!

Working Example
And that’s all there is to it!  You can see this method in action here!

The Implications
For starters, you can use any email image generation script you want, you are not limited to the basic one that I’ve provided.  But more importantly, the replacement argument of the regular expression doesn’t have to use an <img> tag with a link to a PHP image generator.  It can be code for any email address obfuscation method you’d like to use!  Feel free to replace it with JavaScript code, CSS encoding methods, or whatever you can think up.

And all of this can be done with absolutely no modification of source code!  How much time will that save you?

I look forward to your comments!

Properizing names on HTML forms

Wednesday, August 6th, 2008

If you’ve ever put a form online that asks the user for their name, you will want to read this post!

How many times have you gotten form data that contains a persons name, and it isn’t capitalized correctly?  Probably much more often than you would like!  Either the name is in all CAPS, all lowercase, or some mixture that just doesn’t agree with the purpose you are using it for.  I got really tired of having to manually format names from online applications and other various forms.  It takes a good chunk of time to scour that data and manually fix up the capitalization of the names – time that could be better spend on other endeavours.

So I set out determined to find some automated way of fixing the data before it ever gets to me.  Mind you, creating an automated proceedure to properly format names is a daunting task.  There are so many name variations and rules involved.  But it has been done before.

It all starts in the genes
John Cardinal makes companion programs for The Master Genealogist (TMG) from Wholly Genes.  He created a function in one of his programs that properly capitalizes names from the TMG database.  He wrote notes on how to do it here.

Then came along Tim Morgan who took John’s notes and developed a routine in Python to perform the capitalization on names passed to the routine.  Once I found his code snippet, I knew that is was something special.  However, I’m not a Python developer – in fact, I know absolutely nothing about Python except that it is a server-side scripting system similar to Perl or PHP.

I do, however, know JavaScript.  And I know that the majority of browsers support JavaScript and the majority of web users have their JavaScript turned on.  So I decided to undertake the task of porting Tim’s Python routine to Javascript where I think it would be more useful.  This post is the fruit of my efforts!

The Good Stuff
So in your form, you will have your name fields, which should pass the value of the field to the JavaScript function.  Here is an example form field:

<input type="text" name="first_name" id="first_name" onchange="this.value = properizeName(this.value);" />

And the text below is the actual JavaScript function:

function properizeName(name) {
	var upperCase = /^[A-Z]/;  //Regexp for all UPPERCASE words
	var suffixes = new Array("II", "(II)", "III", "(III)", "IV", "(IV)", "VI", "(VI)", "VII", "(VII)", "2nd", "(2nd)", "3rd", "(3rd)", "4th", "(4th)", "5th", "(5th)");
	var surnames = new Array("ApShaw", "d'Albini", "d'Aubigney", "d'Aubigne", "d'Autry", "d'Entremont", "d'Hurst", "D'ovidio", "da Graca", "DaSilva", "DeAnda", "deAnnethe", "deAubigne", "deAubigny", "DeBardelaben", "DeBardeleben", "DeBaugh", "deBeauford", "DeBerry", "deBethune", "DeBetuile", "DeBoard", "DeBoer", "DeBohun", "DeBord", "DeBose", "DeBrouwer", "DeBroux", "DeBruhl", "deBruijn", "deBrus", "deBruse", "deBrusse", "DeBruyne", "DeBusk", "DeCamp", "deCastilla", "DeCello", "deClare", "DeClark", "DeClerck", "DeCoste", "deCote", "DeCoudres", "DeCoursey", "DeCredico", "deCuire", "DeCuyre", "DeDominicios", "DeDuyster", "DeDuytscher", "DeDuytser", "deFiennes", "DeFord", "DeForest", "DeFrance", "DeFriece", "DeGarmo", "deGraaff", "DeGraff", "DeGraffenreid", "DeGraw", "DeGrenier", "DeGroats", "DeGroft", "DeGrote", "DeHaan", "DeHaas", "DeHaddeclive", "deHannethe", "DeHatclyf", "DeHaven", "DeHeer", "DeJager", "DeJarnette", "DeJean", "DeJong", "deJonge", "deKemmeter", "deKirketon", "DeKroon", "deKype", "del-Rosario", "dela Chamotte", "DeLa Cuadra", "DeLa Force", "dela Fountaine", "dela Grena", "dela Place", "DeLa Ward", "DeLaci", "DeLacy", "DeLaet", "DeLalonde", "DelAmarre", "DeLancey", "DeLascy", "DelAshmutt", "DeLassy", "DeLattre", "DeLaughter", "DeLay", "deLessine", "DelGado", "DelGaudio", "DeLiberti", "DeLoache", "DeLoatch", "DeLoch", "DeLockwood", "DeLong", "DeLozier", "DeLuca", "DeLucenay", "deLucy", "DeMars", "DeMartino", "deMaule", "DeMello", "DeMinck", "DeMink", "DeMoree", "DeMoss", "DeMott", "DeMuynck", "deNiet", "DeNise", "DeNure", "DePalma", "DePasquale", "dePender", "dePercy", "DePoe", "DePriest", "DePu", "DePui", "DePuis", "DeReeper", "deRochette", "deRose", "DeRossett", "DeRover", "deRuggele", "deRuggle", "DeRuyter", "deSaint-Sauveur", "DeSantis", "desCuirs", "DeSentis", "DeShane", "DeSilva", "DesJardins", "DesMarest", "deSoleure", "DeSoto", "DeSpain", "DeStefano", "deSwaert", "deSwart", "DeVall", "DeVane", "DeVasher", "DeVasier", "DeVaughan", "DeVaughn", "DeVault", "DeVeau", "DeVeault", "deVilleneuve", "DeVilliers", "DeVinney", "DeVito", "deVogel", "DeVolder", "DeVolld", "DeVore", "deVos", "DeVries", "deVries", "DeWall", "DeWaller", "DeWalt", "deWashington", "deWerly", "deWessyngton", "DeWet", "deWinter", "DeWitt", "DeWolf", "DeWolfe", "DeWolff", "DeWoody", "DeYager", "DeYarmett", "DeYoung", "DiCicco", "DiCredico", "DiFillippi", "DiGiacomo", "DiMarco", "DiMeo", "DiMonte", "DiNonno", "DiPietro", "diPilato", "DiPrima", "DiSalvo", "du Bosc", "du Hurst", "DuFort", "DuMars", "DuPre", "DuPue", "DuPuy", "FitzUryan", "kummel", "LaBarge", "LaBarr", "LaBauve", "LaBean", "LaBelle", "LaBerteaux", "LaBine", "LaBonte", "LaBorde", "LaBounty", "LaBranche", "LaBrash", "LaCaille", "LaCasse", "LaChapelle", "LaClair", "LaComb", "LaCoste", "LaCount", "LaCour", "LaCroix", "LaFarlett", "LaFarlette", "LaFerry", "LaFlamme", "LaFollette", "LaForge", "LaFortune", "LaFoy", "LaFramboise", "LaFrance", "LaFuze", "LaGioia", "LaGrone", "LaLiberte", "LaLonde", "LaLone", "LaMaster", "LaMay", "LaMere", "LaMont", "LaMotte", "LaPeer", "LaPierre", "LaPlante", "LaPoint", "LaPointe", "LaPorte", "LaPrade", "LaRocca", "LaRochelle", "LaRose", "LaRue", "LaVallee", "LaVaque", "LaVeau", "LeBleu", "LeBoeuf", "LeBoiteaux", "LeBoyteulx", "LeCheminant", "LeClair", "LeClerc", "LeCompte", "LeCroy", "LeDuc", "LeFevbre", "LeFever", "LeFevre", "LeFlore", "LeGette", "LeGrand", "LeGrave", "LeGro", "LeGros", "LeJeune", "LeMaistre", "LeMaitre", "LeMaster", "LeMesurier", "LeMieux", "LeMoe", "LeMoigne", "LeMoine", "LeNeve", "LePage", "LeQuire", "LeQuyer", "LeRou", "LeRoy", "LeSuer", "LeSueur", "LeTardif", "LeVally", "LeVert", "LoMonaco", "Macabe", "Macaluso", "MacaTasney", "Macaulay", "Macchitelli", "Maccoone", "Maccurry", "Macdermattroe", "Macdiarmada", "Macelvaine", "Macey", "Macgraugh", "Machan", "Machann", "Machum", "Maciejewski", "Maciel", "Mackaben", "Mackall", "Mackartee", "Mackay", "Macken", "Mackert", "Mackey", "Mackie", "Mackin", "Mackins", "Macklin", "Macko", "Macksey", "Mackwilliams", "Maclean", "Maclinden", "Macomb", "Macomber", "Macon", "Macoombs", "Macraw", "Macumber", "Macurdy", "Macwilliams", "MaGuinness", "MakCubyn", "MakCumby", "Mcelvany", "Mcsherry", "Op den Dyck", "Op den Graeff", "regory", "Schweißguth", "StElmo", "StGelais", "StJacques", "te Boveldt", "VanAernam", "VanAken", "VanAlstine", "VanAmersfoort", "VanAntwerp", "VanArlem", "VanArnam", "VanArnem", "VanArnhem", "VanArnon", "VanArsdale", "VanArsdalen", "VanArsdol", "vanAssema", "vanAsten", "VanAuken", "VanAwman", "VanBaucom", "VanBebber", "VanBeber", "VanBenschoten", "VanBibber", "VanBilliard", "vanBlare", "vanBlaricom", "VanBuren", "VanBuskirk", "VanCamp", "VanCampen", "VanCleave", "VanCleef", "VanCleve", "VanCouwenhoven", "VanCovenhoven", "VanCowenhoven", "VanCuren", "VanDalsem", "VanDam", "VanDe Poel", "vanden Dijkgraaf", "vanden Kommer", "VanDer Aar", "vander Gouwe", "VanDer Honing", "VanDer Hooning", "vander Horst", "vander Kroft", "vander Krogt", "VanDer Meer", "vander Meulen", "vander Putte", "vander Schooren", "VanDer Veen", "VanDer Ven", "VanDer Wal", "VanDer Weide", "VanDer Willigen", "vander Wulp", "vander Zanden", "vander Zwan", "VanDer Zweep", "VanDeren", "VanDerlaan", "VanDerveer", "VanderWoude", "VanDeursen", "VanDeusen", "vanDijk", "VanDoren", "VanDorn", "VanDort", "VanDruff", "VanDryer", "VanDusen", "VanDuzee", "VanDuzen", "VanDuzer", "VanDyck", "VanDyke", "VanEman", "VanEmmen", "vanEmmerik", "VanEngen", "vanErp", "vanEssen", "VanFleet", "VanGalder", "VanGelder", "vanGerrevink", "VanGog", "vanGogh", "VanGorder", "VanGordon", "VanGroningen", "VanGuilder", "VanGundy", "VanHaaften", "VanHaute", "VanHees", "vanHeugten", "VanHise", "VanHoeck", "VanHoek", "VanHook", "vanHoorn", "VanHoornbeeck", "VanHoose", "VanHooser", "VanHorn", "VanHorne", "VanHouten", "VanHoye", "VanHuijstee", "VanHuss", "VanImmon", "VanKersschaever", "VanKeuren", "VanKleeck", "VanKoughnet", "VanKouwenhoven", "VanKuykendaal", "vanLeeuwen", "vanLent", "vanLet", "VanLeuven", "vanLingen", "VanLoozen", "VanLopik", "VanLuven", "vanMaasdijk", "VanMele", "VanMeter", "vanMoorsel", "VanMoorst", "VanMossevelde", "VanNaarden", "VanNamen", "VanNemon", "VanNess", "VanNest", "VanNimmen", "vanNobelen", "VanNorman", "VanNormon", "VanNostrunt", "VanNote", "VanOker", "vanOosten", "VanOrden", "VanOrder", "VanOrma", "VanOrman", "VanOrnum", "VanOstrander", "VanOvermeire", "VanPelt", "VanPool", "VanPoole", "VanPoorvliet", "VanPutten", "vanRee", "VanRhijn", "vanRijswijk", "VanRotmer", "VanSchaick", "vanSchelt", "VanSchoik", "VanSchoonhoven", "VanSciver", "VanScoy", "VanScoyoc", "vanSeters", "VanSickle", "VanSky", "VanSnellenberg", "vanStaveren", "VanStraten", "VanSuijdam", "VanTassel", "VanTassell", "VanTessel", "VanTexel", "VanTuyl", "VanValckenburgh", "vanValen", "VanValkenburg", "VanVelsor", "VanVelzor", "VanVlack", "VanVleck", "VanVleckeren", "VanWaard", "VanWart", "VanWassenhove", "VanWinkle", "VanWoggelum", "vanWordragen", "VanWormer", "VanZuidam", "VanZuijdam", "VonAdenbach", "vonAllmen", "vonBardeleben", "vonBerckefeldt", "VonBergen", "vonBreyman", "VonCannon", "vonFreymann", "vonHeimburg", "VonHuben", "vonKramer", "vonKruchenburg", "vonPostel", "VonRohr", "VonRohrbach", "VonSass", "VonSasse", "vonSchlotte", "VonSchneider", "VonSeldern", "VonSpringer", "VonVeyelmann", "VonZweidorff");
	surnames = surnames.concat(suffixes); //Append suffixes array to the end of surnames
	var mc = /^Mc(\w)(?=\w)/i; //Regexp for "Mc"
	var mac = /^Mac(\w)(?=\w)/i; //Regexp for "Mac"
	var hyphen_index = new Array();
	var hyphen = false;
	while (name.indexOf('-') > -1) { //Loops to record positions of hypens (to put them back later) and convert the hypen to a space (to break up name into individual words)
		index = name.indexOf('-');
		if (index == 0) { //If name begins with hypen, just remove the first character from the name and loop again
			name = name.substr(1);
			continue;
		}
		hyphen_index.push(index); //Record hyphen position
		name = name.substring(0, index) + ' ' + name.substr(index+1); //Change hyphen to a space
		hyphen = true;
	}
	var period_index = new Array();
	var period = false;
	while (name.indexOf('.') > -1) { //Loops to record positions of periods (to put them back later) and convert the period to a space (to break up name into individual words)
		index = name.indexOf('.');
		if (index == 0) { //If name begins with period, just remove the first character from the name and loop again
			name = name.substr(1);
			continue;
		}
		period_index.push(index); //Record period position
		name = name.substring(0, index) + ' ' + name.substr(index+1); //Change period to a space
		period = true;
	}
	var names = name.split(' '); //Put individual words in name into an array
	for (i = 0; i < names.length; i++) //Loop through words in name if they are all CAPS, make them all lowercase
		if (upperCase.test(names[i]))
			names[i] = names[i].toLowerCase();
	for (i = 0; i < names.length; i++) //Loop through words in name and capitalize the first letter
		names[i] = names[i].charAt(0).toUpperCase() + names[i].substr(1); //Change word to capitalized version
	for (i = 0; i < names.length; i++) { //Loop through words in name and check for "mcx" and "macx"
		if (mc.test(names[i])) //Look for "Mc" start of name word
			names[i] = "Mc" + names[i].charAt(2).toUpperCase() + names[i].substr(3); //Change word to capitalized version
//		if (mac.test(names[i])) //Look for "Mac" start of name word
//			names[i] = "Mac" + names[i].charAt(3).toUpperCase() + names[i].substr(4); //Change word to capitalized version
	}
	name = names.join(' '); //Join words of name back together
	if (hyphen) //Add hyphens back if they originally existed
		for (i = 0; i < hyphen_index.length; i++) //Cycle through hyphen index
			name = name.substr(0, hyphen_index[i]) + '-' + name.substr(hyphen_index[i]+1);  //Replace positions in name from hyphen index with hyphens
	if (period) //Add periods back if they originally existed
		for (i = 0; i < period_index.length; i++) //Cycle through period index
			name = name.substr(0, period_index[i]) + '.' + name.substr(period_index[i]+1);  //Replace positions in name from period index with period
	name = name.replace(/ De /gi, ' de '); //Replace ' De ' with ' de '
	name = name.replace(/ Dit /gi, ' dit '); //Replace ' Dit ' with ' dit '
	name = name.replace(/ Van /gi, ' van '); //Replace ' Van ' with ' van '
	lcName = name.toLowerCase(); //Copy of name in lower-case
	for (i = 0; i < surnames.length; i++) {
		pos = lcName.indexOf(surnames[i].toLowerCase());
		if (pos > -1) {
			if (((pos == 0) || (pos > 0 && name.charAt(pos-1) == ' ')) && ((name.length == pos+surnames[i].length) || (name.charAt(pos+surnames[i].length) == ' ')))
				name = name.substring(0, pos) + surnames[i] + name.substr(pos+surnames[i].length);
		}
	}
	return name;
}

I’d love to hear your comments!