Fun with mb_strlen

I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:

	function mb_strlen( $str, $enc = "" ) {
		preg_match_all( '/./us', $str, $matches );
		return count($matches);
	}

There are two things to note about this code:

  1. It doesn’t actually work, because no matches are done — it always returns 1
  2. Even if you fix it to return the matches, it’s extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string

I’m replacing this with a new version which uses PHP’s count_chars() function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It’s still a smidge slower than mb_strlen but it’s… much better than the old one.

	/**
	 * Fallback implementation of mb_strlen, hardcoded to UTF-8.
	 * @param string $str
	 * @param string $enc optional encoding; ignored
	 * @return int
	 */
	function new_mb_strlen( $str, $enc="" ) {
		$counts = count_chars( $str );
		$total = 0;

		// Count ASCII bytes
		for( $i = 0; $i < 0x80; $i++ ) {
			$total += $counts[$i];
		}

		// Count multibyte sequence heads
		for( $i = 0xc0; $i < 0xff; $i++ ) {
			$total += $counts[$i];
		}
		return $total;
	}

Some quick benchmarks using the UTF-8 normalization benchmark pages (code):

Testing washington.txt:
              strlen      31526 chars    0.007ms
           mb_strlen      31526 chars    0.114ms
       old_mb_strlen      31526 chars 4813.686ms
       new_mb_strlen      31526 chars    0.132ms

Testing berlin.txt:
              strlen      36320 chars    0.001ms
           mb_strlen      35899 chars    0.129ms
       old_mb_strlen      35899 chars 6328.748ms
       new_mb_strlen      35899 chars    0.127ms

Testing bulgakov.txt:
              strlen      36849 chars    0.001ms
           mb_strlen      20418 chars    0.076ms
       old_mb_strlen      20418 chars 3003.042ms
       new_mb_strlen      20418 chars    0.133ms

Testing tokyo.txt:
              strlen      36244 chars    0.001ms
           mb_strlen      19936 chars    0.071ms
       old_mb_strlen      19936 chars 2623.109ms
       new_mb_strlen      19936 chars    0.131ms

Testing young.txt:
              strlen      36694 chars    0.001ms
           mb_strlen      16676 chars    0.063ms
       old_mb_strlen      16676 chars 2246.179ms
       new_mb_strlen      16676 chars    0.125ms

2 thoughts on “Fun with mb_strlen”

  1. The usual hack to use strlen(utf8_decode($str)); and rely on anything non 8859-1 to be output as a single question mark.

  2. Hm, that’s clever too. :)

    Turns out it’s actually slower than my count_chars() method, though, on article-size strings. (By about a factor of 4 for primarily-ASCII text, or three or two for 2-byte and 3-byte-per-char ranges.)

    Your method is faster for short strings… but all are well under a millisecond on my 2.33 GHz Core Duo test box for long strings, and under a tenth of a ms for the short strings, so it perhaps gets into splitting hairs. ;)

Comments are closed.