Parallelizing PNG compression, part 1

I’ve noticed that on my Linux workstation (an older CPU with a current graphics card and dual 4K monitors) taking screenshots by pressing “PrintScreen” in the GNOME desktop seems really slow: specifically, there is a long delay of a second or more between tapping the key and the start of the audiovisual “camera flash” feedback.

Profiling with the “perf” tool and checking the actual source for gnome-shell confirmed three things:

  1. gnome-shell doesn’t start the camera flash effect until after the screenshot is saved to disk as a PNG file, meaning the delay is mostly up to the PNG image compression.
  2. PNG compression is slow, and single-threaded! It takes about 1.25s to save a 7680×2160 desktop screenshot PNG on this machine; 0.75s on my faster laptop. A top of the line modern workstation can probably hit 0.5s, but that’s still a longish delay for feedback of a keyboard event.
  3. The two biggest contributors to CPU time are filtering the output pixels before compression (part of libpng), and the actual “deflate” compression (part of zlib).

I’ve divided the problem up into two areas to work on, one of which looks easy and one of which looks tractable but is a little more work to do right:

  1. Patch gnome-shell to split the screenshot provider into separate “capture” and “write” steps, then start the camera flash effect after capture to run in parallel with the compression. On most machines the screenshot will already be compressed before the visual effect finishes, making it seem instantaneous! On corner cases like mine, it’ll still feel fast, but the PNG may not be available until after the flash finishes.
  2. Create a library to write PNGs using multithreading [for the actual filter/compression stages] (either as a helper for libpng, or standalone), and figure out how to get it used in various places. (My screenshot case would be gdk-pixbuf.)

The patch for gnome-shell is mostly written but I haven’t yet figured out how to build and test the latest gnome-shell on Fedora 28. :) Will poke at that in a bit.

For parallelizing the actual compression, I found a number of tools to do multithreaded gzip/deflate wrapping around regular zlib, and one old half-finished experiment at specifically doing so for PNG saving. Learn more about them in our upcoming blog post part 2! (ooh! exciting!)

VP9 decoder hotspots in asm.js: multiplication

When using ogv.js in the old IE 11 browser to play VP9 video, one of the biggest CPU hotspots is the vpx_convolve8_c function, which applies pixel filtering in a target area.

The C looks like:

All the heavy lifting being in those two functions convolve_horiz and convolve_vert:

It looks a little oversimple with tight loops and function calls, but everything is inlined aggressively by the compiler and the inner loop unrolled. The asm.js looks something like:

(Note some long lines are cut off in the unrolled loop.)

This seems fairly optimal, but know that those multiplications are slow — they’ll be floating-point multiplication because the semantics of JavaScript’s floating-point multiply operator don’t lend themselves well to automatic consolidation into integer multiplication. And it’s already an optimization that it’s doing _that_!

Normally emscripten actually doesn’t emit a multiply operator for an integer multiplication like this — it instead emits a call to Math.imul which implements 32-bit integer multiplication correctly and, when implemented, quickly. But in IE 11 there’s no native Math.imul instruction because it’s older than that addition to the JavaScript standard…

The emscripten compiler can provide an emulated replacement for Math.imul when using the LEGACY_VM_SUPPORT option, but it’s very slow — a function call, two multiplications, some bit-shifts, and addition.

Since I know (hope?) the multiplications inside libvpx never overflow, I run a post-processing pass on the JavaScript that replaces the very slow Math.imul calls with only moderately slow floating-point multiplications. This made a significant difference to total speed, something like 10-15%, when I added it to our VP8 and VP9 decoding.

Unfortunately optimizing it further looks tricky without SIMD optimizations. The native builds of these libraries make aggressive use of SIMD (single-instruction-multiple-data) to apply these filtering steps to several pixels at once, and it makes a huge improvement to throughput.

There has been experimentation for some time in SIMD support for asm.js, which seems to be being dropped now in favor of moving it directly into WebAssembly. If/when this eventually arrives in Safari it’ll be a big improvement there — but IE 11 will never update, being frozen in time.

Revisiting AVSampleBufferDisplayLayer on iOS 11

When I tried using iOS’s AVSampleBufferDisplayLayer in OGVKit last year, I had a few problems. Most notably a rendering bug on one device, and inability to display frames with 4:4:4 subsampling (higher chroma quality).

Since the rendering path I used instead is OpenGL-based, and OpenGL is being deprecated in iOS 12… figured it might be worth another look rather than converting the shader and rendering code to Metal.

The rendering bug that was striking at 360p on some devices was fixed by Apple in iOS 11 (thanks all!), so that’s a nice improvement!

Fiddling around with the available pixel formats, I found that while 4:4:4 YCbCr still doesn’t work, 4:4:4:4 AYCbCr does work in iOS 11 and iOS 12 beta!

First I swapped out the pixel format without adjusting my data format conversions, which produced a naturally corrupt image:

Then I switched up my SIMD-accelerated sample format conversion to combine the three planes of data and a fourth fixed alpha value, but messed it up and got this:

Turned out to be that I’d only filled out 4 items of a 16×8 vector literal. Durrrr. :) Fixed that and it’s now preeeeetty:

(The sample video is a conversion of a silly GIF, with lots of colored text and sharp edges that degrade visibly at 4:2:0 and 4:2:2 subsampling. I believe the name of the original file was “yo-im-hacking-furiously-dude.gif”)

A remaining downside is that on my old 32-bit devices stuck on iOS 9 and iOS 10, the 4:2:2 and 4:4:4 samples don’t play. So either I need to keep the OpenGL code path for them, or just document it won’t work, or maybe do a runtime check and downsample to 4:2:0.

Mobile video encoding tradeoffs

I spent a little time yesterday and today poking at an old project to encode video into the WebM format we use at Wikipedia on your iPhone or iPad so you could, potentially, take video and upload it directly.

The default encoding settings were mmmuuuucccchhhh tttoooooo ssslllooowww to be practical, so I had to tune it to use a faster configuration. But to maintain quality, you have to bump up the bitrate (and thus file size) significantly.

This same tradeoff is made in the hardware video encoders in the device, too! When you’re making a regular camera recording the bitrate is actually several times higher than it would be on a typical download/stream of the same video from YouTube/Netflix/etc. You just don’t have the luxury of the extra encoding time on a modest mobile chip, especially not if you’re recording live.

Video files and thumbnails in MediaWiki’s TimedMediaHandler

One of my ongoing tasks is modernizing and fixing up issues with TimedMediaHandler, the MediaWiki extension we use to handle audio and video files on Wikipedia and Wikimedia Commons.

One of many “little cuts” has been that the thumbnail frame of a video file defaults to the halfway point in the file, which is sometimes great and sometimes useless (such as a blank transition frame).  You can override the thumbnail to use for a given invocation in a wiki page, but can’t set a default to be used in places where no parameter is added… such as when a WebM or Ogg video got selected as the default poster image via the PageImages extension.

To resolve this, we’re reviving an old request to make it possible to set the default thumbnail frame time for a video.

I’ve got a patch partially working, using a parser function so you can put something like {{#thumbtime:90}} on the file page to set the default to 90 seconds in, or {{#thumbtime:35:17}} for 35 minutes and 17 seconds in, etc. The expanded time in seconds is saved as a page property on the File: page, then extracted and copied into the extended file metadata by a hook in TimedMediaHandler. The metadata gets queried when we the generate HTML harness for the video/thumbnail, and we pass the time down in the URL of the thumb just as we would if you had specifically requested that time.

Currently however it’s incompatible with CommonsMetadata extension, which we kind of need. ;) So I’ve got to figure that out and either get them to play nice, or store the data in a different way.

(Just using the page props locally isn’t enough because we have to export the data from Wikimedia Commons to other sites.)

emscripten fun: libffi on WebAssembly part 2

A couple weeks ago I started porting libffi to work on emscripten’s WebAssembly output mode, in the hopes of eventually getting glib or even gtk+ working in browser-compiled apps.

I got it back set up on my Linux box (moved from my Windows/WSL environment) and spent a little time this morning fixing a few more things.

First, 64-bit integers are now supported in call arguments and return values. This requires jumping through some hoops because JavaScript doesn’t yet have large ints, and so the C<->JS ABI divides them into two 32-bit words. For args, you get two args in a row, first the low then the high word. For return values, the low word is returned and the high word is set in a module-global variable which can be retrieved through Module.getTemp0().

Second, I fixed a bunch of float tests that failed because currently the test suite is being built and run in asm.js mode instead of Wasm. There’s a slight ABI difference again, in that in asm.js floats are always promoted to doubles in args  and return values. Checking Module.usingWasm and using the appropriate “f” or “d” in the signature resolves this.

Note the tests are in asm.js mode because the DejaGnu test runner seems to want to run the executables directly, so I tricked emcc into thinking it’s being run from emconfigure with EMCONFIGURE_JS=2 to force it to build a executable with a Node shebang instead of a native executable (whaaaaa) or bare JS that’s not executable… and currently emcc forces Wasm off in this mode, though I think it should not need to. Will look further into that.

Still have hundreds of failing tests as I haven’t implemented structs or closure generation, but it’s progress!

JavaScript engine internals part 5: conclusions

Thoughts from the last couple weeks of prototyping a JavaScript-like scripting language engine core

Scripting inside WebAssembly

A JavaScript-like garbage-collected dynamically-typed scripting engine runtime can indeed be implemented in WebAssembly, with relatively sensible performance for ahead-of-time compiled code. (C++/emscripten-based prototype showing within 3-4x of native JS in node for a floating-point math test without too much micro-optimization yet.)

  • Garbage collection keepalive requires explicit handling of a GC-scanable stack. There’s no way to scan the native stack, as not everything spills from locals to the emscripten-managed stack in linear memory.
  • Tricks like NaN boxing work for squishing variant values into a 64-bit word, and can perform reasonably well.
  • C++/emscripten prototype engine works well enough, but the build is large using standard library (for hashmaps, vectors, strings, and some string formatting at present).
  • It may be worth investigating a dynamic linking system where the engine and the compiled code are separate .wasm files, but this may have performance implications on calls into the engine.

Sandboxing

  • Memory usage can be strictly limited at compile or instantiation time, but moderate DoS vectors exist in long loops and deep recursion (can mitigate with Worker thread isolation)
  • Sandboxing the whole runtime so you can swap out just the .wasm file with a user-provided engine would be a lot more work (avoiding emscripten’s JS-side wrappers and functions or reducing them to a bare safe interface implemented for all engines).
  • Using separate .wasm modules for the engine and the program would expose engine internals in linear memory to the program’s module. Be aware that this is not a secure boundary unless the compiler is trusted.

Areas for further work

Adding an actual ahead-of-time compiler instead of hand-translating JS to C++ source would be a lot more work.

  • Would have to decide what tools to use to parse JS source
  • What kind of lifecycle and toolchain to use (does this get compiled server-side? client-side?)
  • What to emit (C++ as input to emscripten, or direct to LLVM, or super direct to WebAssembly?)

An in-engine interpreter for a REPL loop would be a different sort of work, and might or might not be able to share code with compiler for parsing etc.

How would compiled scripts be debugged, other than by slipping in console.log() invocations and watching error messages?

  • Is it possible to emit source maps for use in browser developer tools so can step through watching the original source instead of Wasm disassembly? How workable is this with Wasm at this point?

Need to imagine some example plugin APIs that could be used for scenarios such as:

  • SVG diagram with some event handlers and animations attached to it
  • SVG diagram or Canvas implementing Logo-like turtle graphics with simple input commands
  • Text editor plugin that describes a simple search-and-replace dialog and implementation

Immediate plans

I’ll probably keep tinkering with this because it’s fun as heck to research how other scripting engines work and see how I can make it myself. ;) But it’s a looooong way from being usable, and I wouldn’t likely be able to push it all the way there any time soon myself.

Will probably prototype the “other side” of the plugin scenarios above and see if I can get more folks at Wikimedia excited about sandboxed plugins and widgets with secure, narrow, versionable interfaces. I need to think of a clever name for it! ;)

In the meantime …. other tasks beckon! TimedMediaHandler needs its fixes finished…

JavaScript engine internals part 4: NaN-boxing vs heap boxing in Mandelbrot

One of my favorite programming samples is calculating a visualization of the Mandelbrot set, a famous fractal. I’ve added it to my experimental JS-like scripting engine to exercise the floating-point performance:

Using the tagged-pointer scheme in my earlier post for storage of variant local types, there was the disadvantage that double-precision floating point numbers (64 bits wide) would not fit in the 31 bits of tagged-pointer space, so they had to be “boxed” (allocated on the heap as an object). This made for runtimes on the Mandelbrot set demo approximately 200x slower than the original JS running in node. Ouch!

I decided to reimplement NaN boxing (which allows storing full 64-bit doubles or a pointer or an int32, all in a single 64-bit word), this time following the JavaScriptCore method that’s easier to unbox pointers versus the SpiderMonkey method. Now I can flip my code between NaN-boxing and heap-boxing with a compile-time #define and compare!

original js: 9 ms

native w/ NaN box: 48 ms
wasm with NaN box: 71 ms

native w/ box doubles: 2677 ms
wasm with box doubles: 1794 ms

NaN-boxing *screams* versus heap-boxing. In part it’s the indirection, but I think it’s mostly having to allocate a hojillion objects and then clean them up, as every temporary result in calculations gets turned into a live object and then garbage-collected after a while. (Note that the native build on my MacBook Pro is slower than the wasm build with heap boxing, probably due to a different allocator and 32-bit vs 64-bit pointers!)

But do we really have to do all that temporary allocation? Most math operators only return a specific type (-, *, and / always return doubles; + can return a string depending on its input). For now the JS code is hand-translated to C++ using operator overloads, so having my C++ operator overloads return doubles when possible, and accept them when possible, lets temporaries that can only be doubles skip the type-checking and boxing/unboxing overhead.

original js: 9 ms

native w/ NaN box: 17 ms
wasm with NaN box: 28 ms

native w/ box doubles: 467 ms
wasm with box doubles: 248 ms

Boxing doubles on the heap is still really slow, but a lot less slow thanks to sidestepping it on many intermediate values… And with the 64-bit NaN-boxing we’re only about 3x slower than native node when running in wasm in node, a big improvement that I’m really happy with!

This makes me feel that 64-bit NaN boxing is the way to go; it’s only slightly more complex than tagging a pointer-sized word and it makes floating point math a ton faster when it’s needed, which sometimes it just is.

Floating point fun: iterating the number space

As a joke the other day I posted a code snippet on Mastodon:

for (var i = -Infinity; i < Infinity; i++) {
  /* this should cover all possibilities */
}

Such an infinite loop is not possible, as adding 1 to -Infinity just gives you -Infinity again, and you never get anywhere. ;)

But even if you started from the minimum representable individual number in a double-precision floating point you wouldn’t actually iterate anywhere because there’s not enough precision to represent the added 1.

Which got me thinking, how *would* you iterate over the entire set of floating point numbers between -Infinity to +Infinity, knowing that the intervals between the numbers are variable?

Well if you want to get everything, including non-integral values, you can iterate over the available range of exponents and fractional values, and either bit-shift them together manually (ick!) or multiply/add/pow them together in floating point land.

Something like this:

function iterate_all_floats(callback) {
  // exponent is additively biased in the binary rep,
  // so we go from negative to positive vals.
  let min_exp = -(Math.pow(2, 10) - 2); // all binary 1s are reserved
  let max_exp = Math.pow(2, 10) - 1;

  let min_frac = 0;
  let max_frac = Math.pow(2, 52) - 1;
  let frac_divisor = Math.pow(2, 52);

  let min_sub = 0;
  let max_sub = Math.pow(2, 52) - 1;
  let sub_divisor = Math.pow(2, 52);

  callback(-Infinity);

  // Start from the biggest negative numbers and go up towards -0
  for (let exp = max_exp; exp >= min_exp; exp--) {
    for (let frac = max_frac; frac >= min_frac; frac--) {
      callback(-(1.0 + frac / frac_divisor) * Math.pow(2, exp));
    }
  }

  // Negative subnormals and -0
  for (let sub = max_sub; sub >= min_sub; sub--) {
    callback(-(sub / sub_divisor) * Math.pow(2, -1022));
  }

  // +0 and positive subnormals
  for (let sub = min_sub; sub <= max_sub; sub++) {
    callback((sub / sub_divisor) * Math.pow(2, -1022));
  }

  // Start from just past +0 and move up to the biggest numbers
  for (let exp = max_exp; exp <= max_exp; exp++) {
    for (let frac = min_frac; frac <= max_frac; frac++) {
      callback((1.0 + frac / frac_divisor) * Math.pow(2, exp));
    }
  }

  callback(Infinity);
}

iterate_all_floats(function(f) {
  // Convert to binary for display
  let floats = new Float64Array(1);
  let bytes = new Uint8Array(floats.buffer);
  let str = '';
  floats[0] = f;
  for (let i = 3; i >= 0; i--) {
    let b = bytes[i];
    for (let bit = 7; bit >= 0; bit--) {
      let val = (b >> bit) & 1;
      str += val;
    }
  }
  console.log(str, f);
});

Enjoy!

JavaScript engine internals part 3: garbage collection

Another part of implementing a JavaScript-like engine to run inside a WebAssembly sandbox is memory management: JS requires garbage collection because objects may hold references to each other, creating cycles that simpler schemes like reference counting can’t break.

Mark and sweep

Most JS engines implement some form of mark-and-sweep garbage collection, usually enhanced with various fancy features we won’t get into here (moving/compaction, generational sweeps, etc).

The mark phase involves tracing your way through the object graph. Starting with globals and objects that are held in variables, you mark the object, setting a flag unique to that object. For any references it holds to other objects, you repeat — whenever you reach an object that’s already been marked you can skip over to the next one without re-tracing its references.

Any object that didn’t get marked is considered unreachable, and gets deleted in the sweep phase. This requires having some way to iterate through all allocated objects, which in my experimental engine I’ve temporarily implemented as a C++ std::unordered_set (a hash map with no payload) which new objects are always added to. This should eventually be replaced with a custom allocator that knows how to iterate through its own objects, or something. :)

The stack

Traditional C/C++ programs store some of their local variables (those that aren’t optimized into registers) in a region of memory known as the stack: a pointer to the stack top is incremented and decremented as functions run to provide storage space, and the same stack is used to store return addresses for function calls (a cause of many security problems).

Some garbage collection systems like Boehm libgc take advantage of this, and scan the stack looking for object references to keep alive during the mark phase. Clever!

The WebAssembly environment has stricter memory safety: byte-addressable “linear memory” has a limited range and you can only access local variables in the stack frame through compile-time-addressed, pre-verified indexes. There’s no way to access a calling function’s stack frames to check what it had saved in its local variables…

When using the emscripten compiler, there is a second, more C/C++ like stack maintained inside the linear memory for variables and data structures that can’t fit in WebAssembly locals (for instance structure types larger than a single word, or anything that has its address taken and a pointer passed somewhere). Theoretically we could scan through that looking for our pointer values, but there may be false positives and that could be trouble — you wouldn’t want to accidentally try to traverse the object references inside a pointer that was actually a random integer or something!

Stacks, scopes, and smart pointers

For now I’ve implemented my own secondary stack which holds only encoded JS values, and modeled JS local variable bindings as C++ pointers to the value on the stack. When a garbage collection happens, the known globals and anything referenced on the stack are marked and traversed, so objects considered in use stay alive.

The way the stack is managed in the C++ engine code is inspired by V8’s Handle & HandleScope/EscapableHandleScope system. It requires a little bit of explicit discipline, and could probably be further improved in terms of the classes making usage correct, but seems to work correctly.

At the beginning of each function, a Scope or ScopeRetVal object is instantiated:

void someFunc() {
  Scope scope;
  // do stuff
}

This saves the stack position, and restores it to its original state when the function exits — so you don’t have to manually remember to pop. That’ll free up any of your locally allocated values for garbage collection.

Any given variable binding is either a ‘Binding’ (alias to Val*) to an existing, elsewhere-managed Val, or a ‘Local’ “smart pointer” object when compiles down to a pointer, but with a constructor that pushes a slot onto the stack.

Arguments can be passed as Local instances as well, constructed on the parent frame and eventually destructed with it, or if you know they’re GC-rooted already for sure you can pass a direct binding.

The big trick is return values, which I needed inspiration from V8 to figure out. If you try to return a Local, it’ll be cleaned up from your own scope before you return — and if you try to return a Val directly it won’t be GC-rooted and can disappear unexpectedly.

RetVal somefunc() {
  ScopeRetVal scope;
  // do stuff
  return scope.escape(new String("keep it safe!"));
}

Functions return a RetVal which doesn’t push, instead of a Local which does. A variant of the Scope, here ScopeRetVal (V8 uses the awesomer name EscapableHandleScope!) first reserves space on the parent scope for that return value. When you return, you have the scope insert the return val into the saved slot, and then the destructor restores the stack to the saved spot _after_ the return value, safely still alive.

Without this trick, return values could end up de-allocated during code like:

var retval = a() + b();

translated to C++ as

Local retval;
*retval = *(a->call(Null(), {})) + *(b->call(Null(), {}));

where the return value from a() gets de-allocated during an allocation inside the call to b().