Quality of Regexp

In my last blog-entry you’ve seen that I compared regexp vs nonregexp-techniques and the result was, that there are „bad“ and „good“ regexp; while the good regexp are as fast as non-regexp-techniques. Let’s take a closer look at what „good“ regexp are. Please notice that I’m not a regexp-magician so if you know about a regexp which would do better than the regexp I’ll show, please let me know.

Want to jump directly to the result and skip the boring php examples? 🙂

detecting mime-encoded-words

According to RFC (2047) a mime-encoded-word needs to fulfill these rules (I hope I haven’t forgotten or misinterpreted one):

A mime-encoded-word

  • starts with =? and ends with ?=
  • consists of a maximum (=?…?=) of 75 characters
  • has four (4) question marks (=?..?..?..?=)
  • consists of charset, encoding and the encoded word (=?charset?encoding?encodedword?=)

also:

  • the charset can contain everything EXCEPT controls “ ()<>@,;:/[]?.=“
  • the encoding can contain everything EXCEPT controls “ ()<>@,;:/[]?.=“
  • the encoded-word can contain any printable ASCII character EXCEPT “ ?“

So… If we put that into an „exact“ (php-)regexp, step by step, we’d start with:

    $token = '[^ \(\)<>@,;:\/\[\]\?\.=]+';

That would cover whitespaces and the special characters above. ascii-controls goes from 0 to 31 and 127 is also a control character. Printable characters are 32 to (including) 126. Doing this in regexp is (might) be somewhat more difficult, there are several ways to achieve what we want. For example p{Cc} might do what we want. [:cntrl:] also. [\x00-\x1F\x7F] this one of course also. I’m prefering the last one for this:

    \x00-\x1F\x7F

Putting that into our $token regexp would look like that:

    $token = '[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+';

We’ll use $token for charset and encoding. Thus let’s take a look at „encoded-word“:

    $encword = '[\x21-\x3E\x40-\x7E]+';

You might wonder where i got x21 and x3E etc from and what they mean. I’d suggest you to read this link at wikipedia 🙂 Take a look at „Hex“ in the tables. The „-“ between \x21 and \x3E is used as „range“ (from to).

Alright. Now let’s form our basic regexp. I’m using _ as delimiter:

    $regexp = "_=\?$token\?$token\?$encword\?=_";

without variables:

    $regexp = '_=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=_';

understood why I use variables to build the regexp? 🙂 However. The regexp isn’t finished yet. We need to make sure that it’s at maximum 75 chars long:

    $regexp = "_(=\?$token\?$token\?$encword\?=){1,75}_";

The above regexp is made to be used with preg_match_all. With preg_match in a foreach (prior splitting of the content by spaces) you’d add ^ at the beginning and $ at the end of the regexp.

However, let’s go back to my test. For my test I use the following regexp:

     // notice, this one is wrong!
     $regexA = '=\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[\x20-\x7E^ \?]+\?=';
     // notice, this one is not very accurate and uses subpattern ()
     $regexB = '=\?(.*)\?(.*)\?(.*)\?=';
     // notice, this one is not very accurate, _not_ using subpattern
     $regexC = '=\?.*\?.*\?.*\?=';
     // notice, needs modifier U and might not be very accurate
     $regexD = '=\?[[:ascii:]]+\?[[:ascii:]]+\?[[:ascii:]]+\?=';
     // this should be the correct and most accurate regexp
     $regexE = '=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=';

The string used for testing is:

     $test = 'From: =?US-ASCII?Q?Keith_Moore?= 
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= 
   CC: =?ISO-8859-1?Q?Andr=E9?= Pirard 
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=';

As you can see, we should get 5 results per regexp. My testscript:


	ob_implicit_flush();

	$input = 'From: =?US-ASCII?Q?Keith_Moore?= 
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= 
   CC: =?ISO-8859-1?Q?Andr=E9?= Pirard 
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=';

     // notice, this one is wrong!
     $regexps[0] = '_=\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[\x20-\x7E^ \?]+\?=_';
     // notice, this one is not very accurate and uses subpattern ()
     $regexps[1] = '_=\?(.*)\?(.*)\?(.*)\?=_';
     // notice, this one is not very accurate, _not_ using subpattern
     $regexps[2] = '_=\?.*\?.*\?.*\?=_';
     // notice, needs modifier U and might not be very accurate
     $regexps[3] = '_=\?[[:ascii:]]+\?[[:ascii:]]+\?[[:ascii:]]+\?=_U';
     // this should be the correct and most accurate regexp
     $regexps[4] = '_=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=_';

	$i = 0;
	foreach($regexps as $regexp)
	{

		echo "Test: #$i\n";
		echo "Run:\n";
		$teststart[$i] = microtime(true);
		for($run = 0; $run < 100000; $run++)
		{
			echo $run;
			$runstart[$i] = microtime(true);
			preg_match_all($regexp, $input, $matches);
			$runcount[$i] = $runcount[$i] + sizeof($matches[0]);
			$runstop[$i] = microtime(true);
			$rundiff[$i] = $rundiff[$i] + ($runstop[$i] - $runstart[$i]);
		}
		$teststop[$i] = microtime(true);
		$testdiff[$i] = $teststop[$i] - $teststart[$i];
		$i++;
		
	}

	echo "testno, testduration, testcount, testmicrotime, testseconds\n";
	for($i = 0; $i < count($regexps); $i++)
	{
		$runmilliseconds = $rundiff[$i] / 1000;
		echo $i.' '.$testdiff[$i].' '.$runcount[$i].' '.$rundiff[$i].' '.$runmilliseconds."\n";
	}

The Result?

As to be expected 100 000 runs:

testno test duration (microseconds) test count avg runtime (microseconds) avg runtime (milliseconds)
_=\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[\x20-\x7E^ \?]+\?=_
0 2.73657202721 µs 500000 1.63955712318 µs 0.00163955712318 ms
_=\?(.*)\?(.*)\?(.*)\?=_
1 6.10814499855 µs 500000 5.00233960152 µs 0.00500233960152 ms
_=\?.*\?.*\?.*\?=_
2 4.76775789261 µs 500000 3.66750049591 µs 0.00366750049591 ms
_=\?[[:ascii:]]+\?[[:ascii:]]+\?[[:ascii:]]+\?=_U
3 2.82996702194 µs 500000 1.73454999924 µs 0.00173454999924 ms
_=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=_
4 2.57315015793 µs 500000 1.4738881588 µs 0.0014738881588 ms

Conclusion?

You can see, that all regexp matches (count of 500 000 - we have 100 000 runs and 5 matches - do the maths, luke!). You can also see, that the slowest regex uses (.*). The regex using .* is a bit faster, which shows that the slowness is not caused by the subpattern (). The last regex, is the fastest (and also the most accurate one) - interesting, isn't it?

No Comments

Post a Comment