In my last blog-entry you’ve seen that I compared regexp vs nonregexp-techniques and the result was, that there are „bad“ and „good“ regexp; while the good regexp are as fast as non-regexp-techniques. Let’s take a closer look at what „good“ regexp are. Please notice that I’m not a regexp-magician so if you know about a regexp which would do better than the regexp I’ll show, please let me know.
Want to jump directly to the result and skip the boring php examples? 🙂
detecting mime-encoded-words
According to RFC (2047) a mime-encoded-word needs to fulfill these rules (I hope I haven’t forgotten or misinterpreted one):
A mime-encoded-word
- starts with =? and ends with ?=
- consists of a maximum (=?…?=) of 75 characters
- has four (4) question marks (=?..?..?..?=)
- consists of charset, encoding and the encoded word (=?charset?encoding?encodedword?=)
also:
- the charset can contain everything EXCEPT controls “ ()<>@,;:/[]?.=“
- the encoding can contain everything EXCEPT controls “ ()<>@,;:/[]?.=“
- the encoded-word can contain any printable ASCII character EXCEPT “ ?“
So… If we put that into an „exact“ (php-)regexp, step by step, we’d start with:
$token = '[^ \(\)<>@,;:\/\[\]\?\.=]+';
That would cover whitespaces and the special characters above. ascii-controls goes from 0 to 31 and 127 is also a control character. Printable characters are 32 to (including) 126. Doing this in regexp is (might) be somewhat more difficult, there are several ways to achieve what we want. For example p{Cc} might do what we want. [:cntrl:] also. [\x00-\x1F\x7F] this one of course also. I’m prefering the last one for this:
\x00-\x1F\x7F
Putting that into our $token regexp would look like that:
$token = '[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+';
We’ll use $token for charset and encoding. Thus let’s take a look at „encoded-word“:
$encword = '[\x21-\x3E\x40-\x7E]+';
You might wonder where i got x21 and x3E etc from and what they mean. I’d suggest you to read this link at wikipedia 🙂 Take a look at „Hex“ in the tables. The „-“ between \x21 and \x3E is used as „range“ (from to).
Alright. Now let’s form our basic regexp. I’m using _ as delimiter:
$regexp = "_=\?$token\?$token\?$encword\?=_";
without variables:
$regexp = '_=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=_';
understood why I use variables to build the regexp? 🙂 However. The regexp isn’t finished yet. We need to make sure that it’s at maximum 75 chars long:
$regexp = "_(=\?$token\?$token\?$encword\?=){1,75}_";
The above regexp is made to be used with preg_match_all. With preg_match in a foreach (prior splitting of the content by spaces) you’d add ^ at the beginning and $ at the end of the regexp.
However, let’s go back to my test. For my test I use the following regexp:
// notice, this one is wrong! $regexA = '=\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[\x20-\x7E^ \?]+\?='; // notice, this one is not very accurate and uses subpattern () $regexB = '=\?(.*)\?(.*)\?(.*)\?='; // notice, this one is not very accurate, _not_ using subpattern $regexC = '=\?.*\?.*\?.*\?='; // notice, needs modifier U and might not be very accurate $regexD = '=\?[[:ascii:]]+\?[[:ascii:]]+\?[[:ascii:]]+\?='; // this should be the correct and most accurate regexp $regexE = '=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=';
The string used for testing is:
$test = 'From: =?US-ASCII?Q?Keith_Moore?=To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= CC: =?ISO-8859-1?Q?Andr=E9?= Pirard Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=';
As you can see, we should get 5 results per regexp. My testscript:
ob_implicit_flush(); $input = 'From: =?US-ASCII?Q?Keith_Moore?=To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= CC: =?ISO-8859-1?Q?Andr=E9?= Pirard Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?='; // notice, this one is wrong! $regexps[0] = '_=\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[\x20-\x7E^ \?]+\?=_'; // notice, this one is not very accurate and uses subpattern () $regexps[1] = '_=\?(.*)\?(.*)\?(.*)\?=_'; // notice, this one is not very accurate, _not_ using subpattern $regexps[2] = '_=\?.*\?.*\?.*\?=_'; // notice, needs modifier U and might not be very accurate $regexps[3] = '_=\?[[:ascii:]]+\?[[:ascii:]]+\?[[:ascii:]]+\?=_U'; // this should be the correct and most accurate regexp $regexps[4] = '_=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=_'; $i = 0; foreach($regexps as $regexp) { echo "Test: #$i\n"; echo "Run:\n"; $teststart[$i] = microtime(true); for($run = 0; $run < 100000; $run++) { echo $run; $runstart[$i] = microtime(true); preg_match_all($regexp, $input, $matches); $runcount[$i] = $runcount[$i] + sizeof($matches[0]); $runstop[$i] = microtime(true); $rundiff[$i] = $rundiff[$i] + ($runstop[$i] - $runstart[$i]); } $teststop[$i] = microtime(true); $testdiff[$i] = $teststop[$i] - $teststart[$i]; $i++; } echo "testno, testduration, testcount, testmicrotime, testseconds\n"; for($i = 0; $i < count($regexps); $i++) { $runmilliseconds = $rundiff[$i] / 1000; echo $i.' '.$testdiff[$i].' '.$runcount[$i].' '.$rundiff[$i].' '.$runmilliseconds."\n"; }
The Result?
As to be expected 100 000 runs:
testno | test duration (microseconds) | test count | avg runtime (microseconds) | avg runtime (milliseconds) |
_=\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[^ \(\)\<\>\@\,\;\:\/\[\]\?\.\=\x00-\x1F\x7F]+\?[\x20-\x7E^ \?]+\?=_ | ||||
0 | 2.73657202721 µs | 500000 | 1.63955712318 µs | 0.00163955712318 ms |
_=\?(.*)\?(.*)\?(.*)\?=_ | ||||
1 | 6.10814499855 µs | 500000 | 5.00233960152 µs | 0.00500233960152 ms |
_=\?.*\?.*\?.*\?=_ | ||||
2 | 4.76775789261 µs | 500000 | 3.66750049591 µs | 0.00366750049591 ms |
_=\?[[:ascii:]]+\?[[:ascii:]]+\?[[:ascii:]]+\?=_U | ||||
3 | 2.82996702194 µs | 500000 | 1.73454999924 µs | 0.00173454999924 ms |
_=\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[^ \(\)<>@,;:\/\[\]\?\.=\x00-\x1F\x7F]+\?[\x21-\x3E\x40-\x7E]+\?=_ | ||||
4 | 2.57315015793 µs | 500000 | 1.4738881588 µs | 0.0014738881588 ms |
Conclusion?
You can see, that all regexp matches (count of 500 000 - we have 100 000 runs and 5 matches - do the maths, luke!). You can also see, that the slowest regex uses (.*). The regex using .* is a bit faster, which shows that the slowness is not caused by the subpattern (). The last regex, is the fastest (and also the most accurate one) - interesting, isn't it?
No Comments