brionv – Page 5 – reticula, electronica, & oddities

JavaScript engine internals part 4: NaN-boxing vs heap boxing in Mandelbrot

One of my favorite programming samples is calculating a visualization of the Mandelbrot set, a famous fractal. I’ve added it to my experimental JS-like scripting engine to exercise the floating-point performance:

Using the tagged-pointer scheme in my earlier post for storage of variant local types, there was the disadvantage that double-precision floating point numbers (64 bits wide) would not fit in the 31 bits of tagged-pointer space, so they had to be “boxed” (allocated on the heap as an object). This made for runtimes on the Mandelbrot set demo approximately 200x slower than the original JS running in node. Ouch!

I decided to reimplement NaN boxing (which allows storing full 64-bit doubles or a pointer or an int32, all in a single 64-bit word), this time following the JavaScriptCore method that’s easier to unbox pointers versus the SpiderMonkey method. Now I can flip my code between NaN-boxing and heap-boxing with a compile-time #define and compare!

original js: 9 ms

native w/ NaN box: 48 ms
wasm with NaN box: 71 ms

native w/ box doubles: 2677 ms
wasm with box doubles: 1794 ms

NaN-boxing *screams* versus heap-boxing. In part it’s the indirection, but I think it’s mostly having to allocate a hojillion objects and then clean them up, as every temporary result in calculations gets turned into a live object and then garbage-collected after a while. (Note that the native build on my MacBook Pro is slower than the wasm build with heap boxing, probably due to a different allocator and 32-bit vs 64-bit pointers!)

But do we really have to do all that temporary allocation? Most math operators only return a specific type (-, *, and / always return doubles; + can return a string depending on its input). For now the JS code is hand-translated to C++ using operator overloads, so having my C++ operator overloads return doubles when possible, and accept them when possible, lets temporaries that can only be doubles skip the type-checking and boxing/unboxing overhead.

original js: 9 ms

native w/ NaN box: 17 ms
wasm with NaN box: 28 ms

native w/ box doubles: 467 ms
wasm with box doubles: 248 ms

Boxing doubles on the heap is still really slow, but a lot less slow thanks to sidestepping it on many intermediate values… And with the 64-bit NaN-boxing we’re only about 3x slower than native node when running in wasm in node, a big improvement that I’m really happy with!

This makes me feel that 64-bit NaN boxing is the way to go; it’s only slightly more complex than tagging a pointer-sized word and it makes floating point math a ton faster when it’s needed, which sometimes it just is.

Floating point fun: iterating the number space

As a joke the other day I posted a code snippet on Mastodon:

for (var i = -Infinity; i < Infinity; i++) {
  /* this should cover all possibilities */
}

Such an infinite loop is not possible, as adding 1 to -Infinity just gives you -Infinity again, and you never get anywhere. ;)

But even if you started from the minimum representable individual number in a double-precision floating point you wouldn’t actually iterate anywhere because there’s not enough precision to represent the added 1.

Which got me thinking, how *would* you iterate over the entire set of floating point numbers between -Infinity to +Infinity, knowing that the intervals between the numbers are variable?

Well if you want to get everything, including non-integral values, you can iterate over the available range of exponents and fractional values, and either bit-shift them together manually (ick!) or multiply/add/pow them together in floating point land.

Something like this:

function iterate_all_floats(callback) {
  // exponent is additively biased in the binary rep,
  // so we go from negative to positive vals.
  let min_exp = -(Math.pow(2, 10) - 2); // all binary 1s are reserved
  let max_exp = Math.pow(2, 10) - 1;

  let min_frac = 0;
  let max_frac = Math.pow(2, 52) - 1;
  let frac_divisor = Math.pow(2, 52);

  let min_sub = 0;
  let max_sub = Math.pow(2, 52) - 1;
  let sub_divisor = Math.pow(2, 52);

  callback(-Infinity);

  // Start from the biggest negative numbers and go up towards -0
  for (let exp = max_exp; exp >= min_exp; exp--) {
    for (let frac = max_frac; frac >= min_frac; frac--) {
      callback(-(1.0 + frac / frac_divisor) * Math.pow(2, exp));
    }
  }

  // Negative subnormals and -0
  for (let sub = max_sub; sub >= min_sub; sub--) {
    callback(-(sub / sub_divisor) * Math.pow(2, -1022));
  }

  // +0 and positive subnormals
  for (let sub = min_sub; sub <= max_sub; sub++) {
    callback((sub / sub_divisor) * Math.pow(2, -1022));
  }

  // Start from just past +0 and move up to the biggest numbers
  for (let exp = max_exp; exp <= max_exp; exp++) {
    for (let frac = min_frac; frac <= max_frac; frac++) {
      callback((1.0 + frac / frac_divisor) * Math.pow(2, exp));
    }
  }

  callback(Infinity);
}

iterate_all_floats(function(f) {
  // Convert to binary for display
  let floats = new Float64Array(1);
  let bytes = new Uint8Array(floats.buffer);
  let str = '';
  floats[0] = f;
  for (let i = 3; i >= 0; i--) {
    let b = bytes[i];
    for (let bit = 7; bit >= 0; bit--) {
      let val = (b >> bit) & 1;
      str += val;
    }
  }
  console.log(str, f);
});

Enjoy!

JavaScript engine internals part 3: garbage collection

Another part of implementing a JavaScript-like engine to run inside a WebAssembly sandbox is memory management: JS requires garbage collection because objects may hold references to each other, creating cycles that simpler schemes like reference counting can’t break.

Mark and sweep

Most JS engines implement some form of mark-and-sweep garbage collection, usually enhanced with various fancy features we won’t get into here (moving/compaction, generational sweeps, etc).

The mark phase involves tracing your way through the object graph. Starting with globals and objects that are held in variables, you mark the object, setting a flag unique to that object. For any references it holds to other objects, you repeat — whenever you reach an object that’s already been marked you can skip over to the next one without re-tracing its references.

Any object that didn’t get marked is considered unreachable, and gets deleted in the sweep phase. This requires having some way to iterate through all allocated objects, which in my experimental engine I’ve temporarily implemented as a C++ std::unordered_set (a hash map with no payload) which new objects are always added to. This should eventually be replaced with a custom allocator that knows how to iterate through its own objects, or something. :)

The stack

Traditional C/C++ programs store some of their local variables (those that aren’t optimized into registers) in a region of memory known as the stack: a pointer to the stack top is incremented and decremented as functions run to provide storage space, and the same stack is used to store return addresses for function calls (a cause of many security problems).

Some garbage collection systems like Boehm libgc take advantage of this, and scan the stack looking for object references to keep alive during the mark phase. Clever!

The WebAssembly environment has stricter memory safety: byte-addressable “linear memory” has a limited range and you can only access local variables in the stack frame through compile-time-addressed, pre-verified indexes. There’s no way to access a calling function’s stack frames to check what it had saved in its local variables…

When using the emscripten compiler, there is a second, more C/C++ like stack maintained inside the linear memory for variables and data structures that can’t fit in WebAssembly locals (for instance structure types larger than a single word, or anything that has its address taken and a pointer passed somewhere). Theoretically we could scan through that looking for our pointer values, but there may be false positives and that could be trouble — you wouldn’t want to accidentally try to traverse the object references inside a pointer that was actually a random integer or something!

Stacks, scopes, and smart pointers

For now I’ve implemented my own secondary stack which holds only encoded JS values, and modeled JS local variable bindings as C++ pointers to the value on the stack. When a garbage collection happens, the known globals and anything referenced on the stack are marked and traversed, so objects considered in use stay alive.

The way the stack is managed in the C++ engine code is inspired by V8’s Handle & HandleScope/EscapableHandleScope system. It requires a little bit of explicit discipline, and could probably be further improved in terms of the classes making usage correct, but seems to work correctly.

At the beginning of each function, a Scope or ScopeRetVal object is instantiated:

void someFunc() {
  Scope scope;
  // do stuff
}

This saves the stack position, and restores it to its original state when the function exits — so you don’t have to manually remember to pop. That’ll free up any of your locally allocated values for garbage collection.

Any given variable binding is either a ‘Binding’ (alias to Val*) to an existing, elsewhere-managed Val, or a ‘Local’ “smart pointer” object when compiles down to a pointer, but with a constructor that pushes a slot onto the stack.

Arguments can be passed as Local instances as well, constructed on the parent frame and eventually destructed with it, or if you know they’re GC-rooted already for sure you can pass a direct binding.

The big trick is return values, which I needed inspiration from V8 to figure out. If you try to return a Local, it’ll be cleaned up from your own scope before you return — and if you try to return a Val directly it won’t be GC-rooted and can disappear unexpectedly.

RetVal somefunc() {
  ScopeRetVal scope;
  // do stuff
  return scope.escape(new String("keep it safe!"));
}

Functions return a RetVal which doesn’t push, instead of a Local which does. A variant of the Scope, here ScopeRetVal (V8 uses the awesomer name EscapableHandleScope!) first reserves space on the parent scope for that return value. When you return, you have the scope insert the return val into the saved slot, and then the destructor restores the stack to the saved spot _after_ the return value, safely still alive.

Without this trick, return values could end up de-allocated during code like:

var retval = a() + b();

translated to C++ as

Local retval;
*retval = *(a->call(Null(), {})) + *(b->call(Null(), {}));

where the return value from a() gets de-allocated during an allocation inside the call to b().

JavaScript engine internals part 2: tagged pointers

In my last post I explored NaN-boxing in JavaScript engines, showing how Firefox and Webkit manage to squish multiple types, including pointers and floating point, into a single 64-bit word.

In my WebAssembly-based JavaScript-like engine experiment I initially implemented NaN boxing similarly to how Firefox/SpiderMonkey do it. It works nicely, but has some downsides:

Extra operations needed to unmask a pointer when dealing with lots of objects
In-memory values which are pointers have to be manually masked when peeking in the debugger. Kind of a pain.
There’s no equivalent aliasing system available in the proposal for native garbage-collected types for WebAssembly.

Inspired by the GC proposal, I’ve tried switching from the 64-bit NaN boxing to native pointer-size (32-bit for Wasm32) words with a minimal tagging system to store integers, which is equivalent to what is available there.

Tagged pointers

So what’s a pointer anyway? It’s a number that points (ha!) to a location in memory where you store stuff. This has a very important consequence:

Because stuff has to be laid out in memory in a way it can be efficiently accessed, structs and classes containing pointers are themselves always located on a pointer-size-aligned boundary. That is, if pointers are 32 bits wide (4 bytes), the structs have to be at a memory location that’s evenly divisible by 4, or it won’t be efficiently accessible.

This means that every pointer to every garbage-collected object is a number divisible by 4… which means its bottom 2 bits are always 0 in binary representation.

Thus, if your value has the bottom bit set, it’s definitely not a pointer to an object. Knowing that, we can divide the 32-bit number space into 32-bit object pointers on the one hand, and on the other hand either 1 tag bit with a 31-bit payload, or 2 tag bits with a 30-bit payload.

The Wasm GC proposal includes an “int31ref” type which can be stored in an “anyref” slot alongside other GC’d reference types, which is meant to use such a scheme.

Implementing tags

In the Val wrapper class in my engine, I’ve adopted this system now and it’s got some trade-offs, but I kinda like it.

The tag check is now a single bit-mask against 1 rather than a giant constant like 0xffff000000000000, which feels cleaner. Integers can be promoted back to 32 bits by right-shifting one bit, while pointers can be consumed as-is once the test is confirmed (or if you know for sure they’re safe).

The biggest downside is that there’s simply not room in 31 bits for very large integers (abs value > 1 billion-ish) or for floating point numbers (64 bits, ay ay!) so they must be boxed on the heap.

Boxing

“Boxing” is the fun task of taking a nice atomic value type, and wrapping it in a giant heap-allocated object. Horribly inefficient if you use lot of them!

But, it can be automatically done in the accessors, and seems to work so far.

I would prefer to avoid using boxing for floats, but there’s just no room in the int space for them unless I do something crazy like using two tag bits and then storing a 24-bit or 16-bit float when a value can be represented that way… yikes! And you’d still have to box values that are out of range or would lose precision.

For the other JS types — undefined, null, boolean, etc — I’m using boxed types as well, but using a single well-known instance value. So all “undefined” values refer to the same Box<Undefined>* and all “booleans” refer to one of two Box<bool>*s, one for true and one for false. This avoids runtime allocation for those special types, while allowing them to be treated as objects for dynamic dispatch or making the type checks cheap pointer equality comparisons for static dispatch.

JavaScript engine internals: NaN-boxing

In researching code plugin sandboxing with WebAssembly, my thoughts naturally turned to how to let people actually program for that environment without forcing use of low-level C++ or such… so naturally, now I’m writing a small JavaScript-like runtime engine and compiler that targets WebAssembly. ;)

My explorations so far are themselves written in C++, with tiny JS samples hand-translated into C++ code that calls the runtime’s classes. If I continue the project, I may switch to directly emitting Wasm source from the translator, with an eye towards future Wasm improvements like native garbage-collected reference types. (Right now I have to implement garbage collection myself, which will be the subject of another blog post!)

The first step of implementing a JavaScript engine is to implement a representation for values, which is tricky because JS values can be any of several distinct types:

undefined
null
boolean
number (float64)
object reference (string, Symbol, Object, etc)

Many other dynamic languages like PHP and Python similarly allow a value to hold multiple competing types like this. There are two main ways to implement a value type like this.

Tagged unions

The first is to use a “tagged union” struct, something like this:

enum TypeTag {
  TAG_UNDEFINED,
  TAG_INT32,
  TAG_DOUBLE,
  ...
  TAG_OBJECT
};

struct Value {
  TypeTag tag;
  union {
    int32_t int32_val;
    double double_val;
    Object* object_val;
  };
};

Then add accessor/mutator functions which check the tag and perform appropriate accesses or conversions. This is simple, but has the downside of requiring 16 bytes per value (to maintain 8-byte alignment on the double-precision float or a 64-bit pointer or int64). That means most reads require two loads, writes require two stores, and you can’t pass a value in a register reliably. (WebAssembly’s local variables and function arguments are sort of like CPU registers in how they’re addressed, so would be the equivalent.)

To get around these, a common trick in high-performance dynamic language runtimes like LuaJIT and all major JavaScript engines is known as NaN boxing.

NaN boxing the SpiderMonkey way

Floating point numbers are a complex beast! A 64-bit double-precision float can be broken down into several sets of bits:

1 bit for sign
11 bits for exponent part
52 bits for binary fractional part

If the exponent bits are all 1, then you have a special value:

if the fractional bits are all 0, then it represents infinity (positive or negative, depending on the sign bit)
if any fractional bit is 1, then it represents “not a number” (NaN), used as a special signal that an operation was invalid or a non-value is being represented.
and oh by the way, regular operations only ever produce a canonical NaN value with just the topmost fractional bit set.

The interesting part is that means you have an entire number space within the range of representable NaN values that will never be produced by legit math operations on legit numbers!

1 bit for sign (set to 1 for a signalling NaN that won’t be produced by math ops)
11 bits for exponent (all set to 1)
1 bit to force it to NaN rather than Infinity if rest is 0s
51 bits for tag and payload

Firefox’s SpiderMonkey engine divides this up into 4 bits of tag (16 types) and 47 bits of address, which is enough for x86_64 (where bit 48 will always be 0 in user space and bits 49-64 aren’t needed … yet) but apparently causes problems on AArch64 and Sparc64. (They work around this by deliberately mapping memory in the lower 47 bits of address space with mmap.)

I’ve chosen to experiment with a similar technique, but only using 3 bits of tag (8 types) to leave a clean 48 bits of address. Though when building to WebAssembly, you only need 32 bits of address so it doesn’t make much difference. ;)

When reading a value of as-yet-unknown type, you mask off the top X bits and compare against some constant values:

each type other than double-precision its has a particular tag value, so you just == for it
actual double-precision values can be detected quickly by forcing the sign bit on and then doing a uint64 compare against a cutoff value: any legit double including canonical NaNs will be <= 0b1111111111111000’0000000000000000’0000000000000000’0000000000000000 once sign/signalling bit is set.
if there are multiple type tags for pointer types, you can put them at the high end of the tag range and do a uint64 compare against a cutoff value

One downside is that every object dereference has to be masked from the stored value, and typical JavaScript code does a lot of objects. It’s a bit-op so it’s cheap, though… and in WebAssembly you can actually wrap the 64-bit value down to 32-bits without masking, since pointers are 32 bits. :)

JavaScriptCore style

WebKit’s JavaScriptCore uses a different technique on 64-bit native architecture.

When storing a double value, the bit pattern is treated as a uint64 and has a constant added to it to “rotate” the NaN patterns such that the top 1 bits in a range of the signaling NaNs turn into 0s.

object references have 0x0000 in the top 16 bits, meaning a 48-bit pointer can be read directly with a 64-bit read — no bit-masking penalty on dereference!
int32 values have 0xffff in the top 16 bits, making common integer operations easy with a mask
true/false/undefined/null are stored as special fake pointer values in the object pointer space
most double-precision float values sit in the 0x0001-0xfffe space, and have to subtract the constant to get back to their original bit pattern.
-Infinity has to be specially stored in the object space because the shift puts it in the wrong place

Sounds clever!

I haven’t checked what Chrome’s V8 and Edge’s ChakraCore do, but I believe they are similar to one or the other of the above.

Security considerations

The downside of all this magic bit-fiddling is that you have to be careful at API boundaries between the JS world and the external world it’s connected to to canonicalize all incoming NaN values in doubles, or it would be possible to create arbitrary object pointer references from within the scripting language — a huge security hole.

So, you know, be careful with that. ;)

Thoughts on WebAssembly as a plugin sandbox

I’ve been thinking about ways to run user-supplied untrusted code in the browser, with an eye towards things like interactive demo programs in Wikipedia articles and user interface & editor extensions. Just running JavaScript provided by the user is wildly unsafe — it can dig into your web page’s UI and submit server requests on your user’s behalf without permission, for instance — and sandboxing things into iframes can be a bit funky and hard to fully lock down.

Current web browsers have another language they can run, though, which is WebAssembly — and WebAssembly makes much stricter sandboxing guarantees:

Sandboxed code has literally no way to access memory or objects not provided to it
- no DOM access
- no network access
- no eval()
Imported objects can only be numbers (read-only) or functions (call-only)
Only numbers can be passed as arguments or returned to/from foreign functions
Maximum memory usage can be set at compile time
Compiled modules can be cached offline in indexedDB and reloaded safely

This means there’s no way an arbitrary wasm binary can access your document.cookies or submit a form, unless you pass in a function that allows that.

This also means that unless you provide an API to the wasm code, it can’t actually do anything other than calculate numbers. It also means that whatever API you do provide becomes a security boundary — you must make sure that you don’t introduce a function that can be exploited contrary to the security guarantees you want to make!

There are also a couple things that wasm by itself doesn’t solve:

The halting problem — like JS code, wasm code can loop forever if it wants and there’s no way for the caller to interrupt it.

But wait — you can run a wasm binary in a Web Worker thread. And you can terminate Web Worker threads from the main thread! Depending on your API, if asynchronous messaging around the calls is workable this might be a good way to avoid hanging the main thread in the face of a misbehaving plugin.

So what’s the downside? Well, a few things to consider:

wasm APIs must be wrapped to idiomatically transfer strings or data buffers with your main JS.
Getting the emscripten compiler to produce “bare wasm” from C/C++ without assuming some JS imports are available seems tricky.
C and C++ may not be the friendliest languages for plugin writers on text/GUI-heavy systems!
Note rustc also has wasm compilation support, but the same concern applies.
Any other language runtimes/libraries you use have to be built into the wasm binary… unless you rig up some kind of cross-module linking and ship multiple modules which link together, which seems super hard right now.

Something to consider for later. :D

Solved: mystery status bar icon on 2010 MacBook Pro

If you have a really old MacBook Pro that can’t do AppleTV mirroring, and you have an AppleTV on your network, turn this checkbox *off* to keep the status bar icon from appearing and immediately disappearing every once in a while.

emscripten fun: porting libffi to WebAssembly part 1

I have a silly dream of seeing graphical Linux/FOSS programs running portably on any browser-based system outside of app store constraints. Two of my weekend side projects are working in this direction: an x86 emulator core to load and run ELF binaries, and better emscripten cross-compilation support for the GTK+ stack.

Emulation ideas

An x86 emulator written in WebAssembly could run pre-built Linux binaries, meaning in theory you could make an automated packager for anything in a popular software repository.

But even if all the hard work of making a process-level emulator work, and hooking up the Linux-level libraries to emulated devices for i/o and rendering, there are some big performance implications, and you’re probably also bundling lots of library code you don’t need at runtime.

Instruction decoding and dispatch will be slow, much slower than native. And it looks pretty complex to do JIT-ing of traces. While I think it could be made to work in principle, I don’t think it’ll ever give a satisfactory user experience.

Cross-compilation

Since we’ve got the source of Linux/FOSS programs by definition, cross-compiling them directly to WebAssembly will give far better performance!

In theory even something from the GNOME stack would work, given an emscripten-specific gdk backend rendering to a WebGL canvas just like games use SDL2 or similar to wrap i/o.

But long before we can get to that, there are low-level library dependencies.

Let’s start with glib, which implements a bunch of runtime functions and the GObject type system, used throughout the stack.

Glib needs libffi, a library for calling functions with at-runtime call signatures and creating closure functions which enclose a state variable around a callback.

In other words, libffi needs to do things that you cannot do in standard C, because it needs system-specific information about how function arguments and return values are transferred (part of the ABI, application binary interface). And to top it off, in many cases (emscripten included) you still can’t do it in C, because asm.js and WebAssembly provide no way to make a call with an arbitrary argument list. So, like a new binary platform, libffi must be ported…

It seems to be doable by bumping up to JavaScript, where you can construct an array of mixed-type arguments and use Function.prototype.apply to call the target function. Using an EM_ASM_ block in my shiny new wasm32/ffi.c I was able to write a JavaScript implementation of the guts of ffi_call which works for int, float, and double parameters (still have to implement 64-bit ints and structs).

The second part of libffi is the closure creation API, which I think requires creating a closure function in the JavaScript side, inserting it into the module’s function tables, and then returning the appropriate index as its address. This should be doable, but I haven’t started yet.

Emscripten function pointers

There are two output targets for the emscripten compiler: asm.js JavaScript and WebAssembly. They have similar capabilities and the wrapper JS is much the same in both, but there are some differences in implementation and internals as well as the code format.

One is in function tables for indirect calls. In both cases, the low-level asm.js/WASM code can’t store the actual pointer address of a function, so they use an index into a table of functions. Any function whose address is taken at compile time is added to the table, and its index used as the address. Then, when an indirect call through a “function pointer” is made, the pointer is used as the index into the function table, and an actual function call is made on it. Voila!

In asm.js, there are lots of Weird Things done about type consistency to make the JavaScript compilers fast. One is that the JS compiler gets confused if you have an array of functions that _contain different signatures_, making indirect calls run through slower paths in final compiled code. So for each distinct function signature (“returns void” or “returns int, called with float32” etc) there was a separate array. This also means that function pointers have meaning only with their exact function signature — if you take a pointer to a function with a parameter and call it without one, it could end up calling an entirely different function at runtime because that index has a different meaning in that signature!

In WebAssembly, this is handled differently. Signatures are encoded at the call site in the call_indirect opcode, so no type inference needs to be done.

But.

At least currently, the asm.js-style table separation is still being used, with the multiple tables encoded into the single WebAssembly table with a compile-time-known constant offset.

In both cases, the JS side can do indirect calls by calculating the signature string (“vf” for “void return, float32 arg” etc) and calling the appropriate “dynCall_vf” etc method, passing first the pointer and then the rest of the argument list. On asm.js this will look up in the tables directly; on WASM it’ll apply the index.

(etc)

It’s possible that emscripten will change the WASM mode to use a single array without the constant offset indirection. This will simplify lookups, and I think make it easier to add more functions at runtime.

Because you see, if you want to add a callback at runtime, like libffi’s closure API wants to, then you need to add another entry to that table. And in asm.js the table sizes are fixed for asm.js validation rules, and in WASM current mode the sub-tables are definitely fixed at compile time, since those constant offsets are used throughout.

So currently there’s an option you can use at build time to reserve room for runtime function pointers, I think I’ll have to use it, but that only reserves *fixed space* of a given number of pointers.

Next

Coming up next time: int64 and struct in the emscripten ABI, and does the closure API work as expected?

String concatenation garbage collection madness!

We got a report of a bug with the new 3D model viewing extension on Wikimedia Commons, where a particular file wasn’t rendering a thumbnail due to an out-of-memory condition. The file was kind-of big (73 MiB) but not super huge, and should have been well within the memory limits of the renderer process.

On investigation, it turned out to be a problem with how three.js’s STLLoader class was parsing the ASCII variant of the file format:

First, the file is loaded as a binary ArrayBuffer
Then, the buffer is checked to see whether it contains binary or text-format data
If it’s text, the entire buffer is converted to a string for further processing

That conversion step had code that looked roughly like this:

var str = '';
for (var i = 0; i < arr.length; i++) {
    str += String.fromCharCode(arr[i]);
}
return str;

Pretty straightforward code, right? Appends one character to the string until the input binary array is out, then returns it.

Well, JavaScript strings are actually immutable — the “+=” operator is just shorthand for “str = str + …”. This means that on every step through the new loop, we create two new strings: one for the character, and a second for the concatenation of the previous string with the new character.

The JavaScript virtual machine’s automatic garbage collection is supposed to magically de-allocate the intermediate strings once they’re no longer referenced (at some point after the next run through the loop) but for some reason this isn’t happening in Node.js. So when we run through this loop 70-some million times, we get a LOT of intermediate strings still in memory and eventually the VM just dies.

Remember this is before any of the 3d processing — we’re just copying bytes from a binary array to a string, and killing the VM with that. What!?

Newer versions of the STLLoader use a more efficient path through the browser’s TextDecoder API, which we can polyfill in node using Buffer, making it blazing fast and memory-efficient… this seems to fix the thumbnailing for this file in my local testing.

Just for fun though I thought, what would it take to get it working in Node or Chrome without the fancy native API helpers? Turns out you can significantly reduce the memory usage of this conversion just by switching the order of operations….

The original append code results in operations like: (((a + b) + c) + d) which increases the size of the left operand linearly as we go along.

If we instead do it like ((a + b) + (c + d)) we’ll increase _both_ sides more slowly, leading to much smaller intermediate strings clogging up the heap.

Something like this, with a sort of binary bisection thingy:

function do_clever(arr, start, end) {
    if (start === end) {
        return '';
    } else if (start + 1 === end) {
        return String.fromCharCode(arr[start]);
    } else {
        var mid = start + Math.floor((end - start) / 2);
        return do_clever(arr, start, mid) +
               do_clever(arr, mid, end);
    }
}

return do_clever(arr, 0, arr.length);

Compared to the naive linear append, I’m able to run through the 73 MiB file in Node, and it’s a bit faster too.

But it turns out there’s not much reason to use that code — most browsers have native TextDecoder (even faster) and Node can fake it with another native API, and those that don’t are Edge and IE, which have a special optimization for appending to strings.

Yes that’s right, Edge 16 and IE 11 actually handle the linear append case significantly faster than the clever version! It’s still not _fast_, with a noticeable delay of a couple seconds on IE especially, but it works.

So once the thumbnail fix goes live, that file should work both in the Node thumbnailer service *and* in browsers with native TextDecoder *and* in Edge and IE 11. Yay!

ogv.js 1.5.7 released

I’ve released ogv.js 1.5.7 with performance boosts for VP8/VP9 video (especially on IE 11) and Opus audio decoding, and a fix for audio/webm seeking.

npm: https://www.npmjs.com/package/ogv
zip: https://github.com/brion/ogv.js/releases/tag/1.5.7