I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:
function mb_strlen( $str, $enc = "" ) { preg_match_all( '/./us', $str, $matches ); return count($matches); }
There are two things to note about this code:
- It doesn’t actually work, because no matches are done — it always returns 1
- Even if you fix it to return the matches, it’s extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string
I’m replacing this with a new version which uses PHP’s count_chars() function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It’s still a smidge slower than mb_strlen but it’s… much better than the old one.
/** * Fallback implementation of mb_strlen, hardcoded to UTF-8. * @param string $str * @param string $enc optional encoding; ignored * @return int */ function new_mb_strlen( $str, $enc="" ) { $counts = count_chars( $str ); $total = 0; // Count ASCII bytes for( $i = 0; $i < 0x80; $i++ ) { $total += $counts[$i]; } // Count multibyte sequence heads for( $i = 0xc0; $i < 0xff; $i++ ) { $total += $counts[$i]; } return $total; }
Some quick benchmarks using the UTF-8 normalization benchmark pages (code):
Testing washington.txt: strlen 31526 chars 0.007ms mb_strlen 31526 chars 0.114ms old_mb_strlen 31526 chars 4813.686ms new_mb_strlen 31526 chars 0.132ms Testing berlin.txt: strlen 36320 chars 0.001ms mb_strlen 35899 chars 0.129ms old_mb_strlen 35899 chars 6328.748ms new_mb_strlen 35899 chars 0.127ms Testing bulgakov.txt: strlen 36849 chars 0.001ms mb_strlen 20418 chars 0.076ms old_mb_strlen 20418 chars 3003.042ms new_mb_strlen 20418 chars 0.133ms Testing tokyo.txt: strlen 36244 chars 0.001ms mb_strlen 19936 chars 0.071ms old_mb_strlen 19936 chars 2623.109ms new_mb_strlen 19936 chars 0.131ms Testing young.txt: strlen 36694 chars 0.001ms mb_strlen 16676 chars 0.063ms old_mb_strlen 16676 chars 2246.179ms new_mb_strlen 16676 chars 0.125ms
The usual hack to use strlen(utf8_decode($str)); and rely on anything non 8859-1 to be output as a single question mark.
Hm, that’s clever too. :)
Turns out it’s actually slower than my count_chars() method, though, on article-size strings. (By about a factor of 4 for primarily-ASCII text, or three or two for 2-byte and 3-byte-per-char ranges.)
Your method is faster for short strings… but all are well under a millisecond on my 2.33 GHz Core Duo test box for long strings, and under a tenth of a ms for the short strings, so it perhaps gets into splitting hairs. ;)