Regexp vs Non-Regexp performance

Yesterday I was talking with Sven about my Imap-Class and I told him that I switched regexp to non-regexp to gain performance. He was asking me whether the use of str* is really faster than the use of regexp in PHP and thus he said „you could write an entry about that in your blog, I’d be interested“ – Hey Sven, here you go.

For my tests I used debian lenny’s PHP 5.2.6 (in debian squeeze) using CLI. All tests have been averaged (the first test-case 1000 runs, the second test-case 10 000 runs). If you want to skip the „how I did it“ and just want to read my conclusion, click this link 🙂

Notice: The regexp used here for detecting mime-encoded-words is not 100% correct; if you’re looking for a correct variant, take a look at my next article here.

1st test-case (mime-encoded-words)

For my test case I’m using mime-encoded-words – a few thousand of them:


	$mew = "";
	$word = "=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= =?US-ASCII?Q?Keith_Moore?=";
	for($i = 0; $i < 80000; $i++)
		$mew .= $word." ";
	$tw = trim($mew);

Now after generating that list, we can start our time-measurements using microtime();

	$start = microtime(true); 

I'm processing that list in a foreach, by exploding our wordlist using a whitespace as delimiter:

	foreach($elements as $element)
	{
		if(
			(substr($element, 0, 2) == '=?') &&
			(substr($element, -2) == '?=') &&
			(substr_count($element, '?') == 4)
		){
			$reAcount++;
		}
	}
	$stop = microtime(true);
	$testcaseA = ($stop - $start)

within that foreach we'll need the str* stuff

		if(
			(substr($element, 0, 2) == '=?') &&
			(substr($element, -2) == '?=') &&
			(substr_count($element, '?') == 4)
		){
			$reAcount++;
		}

Let's calculate the time:

	$stop = microtime(true);
	$testcaseA = ($stop - $start);

Let's do the same with preg now. The content of the foreach:

		if(preg_match('/^=\?(.*)\?(.*)\?(.*)\?=$/is', $element))
		{
			$reBcount++;
		}

Someone might say, my regexp isn't very good and the use of (.*) is causing all this stuff.. Here you go:

	$token = '[^ \?]';
	$especials = '[^ \(\)\<\>\@\,\;\:\/\[\]\.\?\=]';

content of foreach:

		if(preg_match('/^=\?'.$token.'*\?'.$token.'*\?'.$especials.'*\?=$/is', $element))
		{
			$reDcount++;
		}

last but not least there's also preg_match_all. With it we can avoid the foreach:

	$start = microtime(true);
	preg_match_all('/=\?(.*)\?(.*)\?(.*)\?=/U', $tw, $matches);
	$reCcount = sizeof($matches[0]);
	$stop = microtime(true);
	$testcaseC = ($stop - $start);

The averaged result of 1000 test runs is:

Type microseconds milliseconds Count
str* 0.131636329651 0.000132 75000
preg_match #1 0.530133208275 0.00053 75000
preg_match #2 0.172833989382 0.000173 75000
preg_match_all #1 0.420841172218 0.000421 75000
preg_match_all #2 0.144988677502 0.000145 75000

preg_match #1 means (.*) was used
preg_match #2 means [^ ...]* was used
preg_match_all #1 and #2 as above.

Conclusion?

First of all, this test-case is not ideal; I'm using the same mime-encoded-words (3 different ones) and there's apart from the mime-encoded words nothing else in the string. This test-case is still useful tho, because you can see the similarities between the results. Interesting is for example that using [^ ..]* in the regexp is faster than using (.*) and using preg_match_all is as fast as str*. However: As you can see in the results, we're talking about microseconds, it's not like you would notice a difference between 0,0004 and 0,0018, would you? If we put it together: str* is slightly faster than a proper preg_match_all which is slightly faster than a proper preg_match - a general preg_match using (.*) is the slowest approach. I guess preg_match is slower than preg_match_all in this case, due to overhead from using foreach.

2nd test-case (mime-encoded-words out of rfc 2047)

My 2nd test-case parses the whole text of rfc 2047 - I just downloaded the .txt and wrote the following in the top of my php script:

	$mew = trim(str_replace(array('(', ')', '"', '\r\n', '\n', '\0', '\t'), '', implode('', file('rfc2047.txt'))));
	$tw = preg_replace('_\s+_', ' ', $mew);

So basically, I'm just removing a few characters, newlines, tabs ( I know unlikely that \0 and \t will be there.. ) and I converted many spaces to one space. Processing that with my previous script shows the following result:

Type microseconds milliseconds Count
str* 0.00320203952789 0 22
preg_match #1 0.00531774499416 0 22
preg_match #2 0.00729660811424 0 22
preg_match_all #1 0.000311268949509 0 25
preg_match_all #2 0.000101275420189 0 22

Conclusion?

Well. I'm a bit confused and impressed myself. Because I expected str* to be faster here. The opposite is the case. While doing this test I noticed inacurate behaviour of preg*, thus preg_match got 25 as count where it should get 22. I wasn't able to find out what's causing that; however: regexp might be inaccurate and shouldn't be generalized using (.*) if it's possible, avoid (.*). A proper regex and preg_match_all is faster here than str* still str* is faster than a regexp using (.*).

3rd test-case (parsing a log-file / getting the date)

For one of my projects I wrote a log-parser, the first version of it was using a lot of regexp and it was damn slow. I'm not sure whether this was due to the use of regexp (I rewrote the whole thing recently and switched from using preg* to non-regexp - Let's take a look. I'm using a real mail log file:

wdp@lunar ~ $ du -h mail.txt
2.2M    mail.txt

And the following regexp to get the date (1000 runs):

preg_match('/^([A-Za-z]{3}) ([0-9]{1,2}) ([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) (.*)$/', $logentry, $matches);

For the non-regexp variant I'm using substr, str*, explode.

The result is:

type parsed lines skipped lines microseconds milliseconds
regexp 16180000 0.254695117474 0.000254695117474
non-regexp 16180000 0.267557286263 0.000267557286263

Conclusion

again, not much difference between both variants, however in this case, regexp are faster.

Overall conclusion

The performance of regexp depends on the regexp itself. I noticed that (.*) is slower than for example [^ \?]* or [A-Za-z0-9]. So the general rule should be: try to write your regexp as exact as possible. Generally regexp seems to be slightly slower than str* and similar things. In some scenarios regexp MIGHT be faster. The general rule "try to avoid regexp whenever possible" is wrong, tho. However: Using regexp you might make things worse (in some cases, they might be slower and they might be inaccurate) keeping that in mind, there's some sense in "try to avoid them" 🙂

No Comments

Post a Comment