I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:
function mb_strlen( $str, $enc = "" ) { preg_match_all( '/./us', $str, $matches ); return count($matches); }
There are two things to note about this code:
- It doesn’t actually work, because no matches are done — it always returns 1
- Even if you fix it to return the matches, it’s extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string
I’m replacing this with a new version which uses PHP’s count_chars() function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It’s still a smidge slower than mb_strlen but it’s… much better than the old one.
/** * Fallback implementation of mb_strlen, hardcoded to UTF-8. * @param string $str * @param string $enc optional encoding; ignored * @return int */ function new_mb_strlen( $str, $enc="" ) { $counts = count_chars( $str ); $total = 0; // Count ASCII bytes for( $i = 0; $i < 0x80; $i++ ) { $total += $counts[$i]; } // Count multibyte sequence heads for( $i = 0xc0; $i < 0xff; $i++ ) { $total += $counts[$i]; } return $total; }
Some quick benchmarks using the UTF-8 normalization benchmark pages (code):
Testing washington.txt: strlen 31526 chars 0.007ms mb_strlen 31526 chars 0.114ms old_mb_strlen 31526 chars 4813.686ms new_mb_strlen 31526 chars 0.132ms Testing berlin.txt: strlen 36320 chars 0.001ms mb_strlen 35899 chars 0.129ms old_mb_strlen 35899 chars 6328.748ms new_mb_strlen 35899 chars 0.127ms Testing bulgakov.txt: strlen 36849 chars 0.001ms mb_strlen 20418 chars 0.076ms old_mb_strlen 20418 chars 3003.042ms new_mb_strlen 20418 chars 0.133ms Testing tokyo.txt: strlen 36244 chars 0.001ms mb_strlen 19936 chars 0.071ms old_mb_strlen 19936 chars 2623.109ms new_mb_strlen 19936 chars 0.131ms Testing young.txt: strlen 36694 chars 0.001ms mb_strlen 16676 chars 0.063ms old_mb_strlen 16676 chars 2246.179ms new_mb_strlen 16676 chars 0.125ms