Wasm RPC thoughts

I spent far too much time lately going down a rabbit hole thinking about how to do a safe, reasonably efficient, reasonably sane API for cross-WebAssembly-module function calls and data transfer… this would allow a Wasm or JS “app” to interact with Wasm “plugins” with certain guarantees about the plugin’s ability to access host memory and functions.

Ended up putting together a fake “slide deck” to explain it to myself, which y’all are welcome to enjoy below. :)

I’ll keep this going in the background while I’m catching up on other stuff, then get it back up and running with my plugin research later.

SIMD in WebAssembly – tales from the bleeding edge

While benchmarking the AV1 and VP9 video decoders in ogv.js, I bemoaned the lack of SIMD vector operations in WebAssembly. Native builds of these decoders lean heavily on SIMD (AVX and SSE for x86, Neon for ARM, etc) to perform operations on 8 or 16 pixels at once… Turns out there has been movement on the WebAssembly SIMD proposal after all!

Chrome’s V8 engine has implemented it (warning: somewhat buggy still), and the upstream LLVM Wasm code generator will generate code for it using clang’s vector operations and some intrinsic functions.

emscripten setup

The first step in your SIMD journey is to set up your emscripten development environment for the upstream compiler backend. Install emsdk via git — or update it if you’ve got an old copy.

Be sure to update the tags as well:

./emsdk update-tags

If you’re on Linux you can download a binary installation, but there’s a bug in emsdk that will cause it not to update. (Update: this was fixed a few days ago, so make sure to update your emsdk!)

./emsdk install latest-upstream
./emsdk activate latest-upstream

On Mac or Windows, or to install the latest upstream source on purpose, you can have it build the various tools from source. There’s not a convenient “sdk” catch-all tag for this that I can see, so you may need to call out all the bits:

./emsdk install emscripten-incoming-64bit
./emsdk activate emscripten-incoming-64bit
./emsdk install upstream-clang-master-64bit
./emsdk activate upstream-clang-master-64bit
./emsdk install binaryen-master-64bit
./emsdk activate binaryen-master-64bit

First build may take a couple hours or so, depending on your machine.

Re-running the install steps will update the git checkouts and re-build, which doesn’t take as long as a fresh build usually but can still take some time.

Upstream backend differences

Be warned that with the upstream backend, emscripten cannot build asm.js, only WebAssembly. If you’re making mixed JS & WebAssembly builds this may complicate your build process, because you have to switch back.

You can switch back to the current fastcomp backend at any time by swapping your sdk state back:

./emsdk install latest
./emsdk activate latest

Note that every time you switch, the cached libc will get rebuilt on your next emcc invocation.

Currently there are some code-gen issues where the upstream backend produces more local variables and such than the older fastcomp backend, which can cause a slight slowdown in code of a few % (and for me, a bigger slowdown in Safari which does particularly well with the old compiler’s output). This is being actively worked on, and is expected to improve significantly soon.

Starting Chrome

Now you’ve got a compiler; you’ll also need a browser to run your code in. Chrome’s V8 includes support behind an experimental runtime flag; currently it’s not exposed to the user interface so you must pass it on the command line.

I recommend using Chrome Canary on Mac or Windows, or a nightly Chromium build on Linux, to make sure you’ve got any fixes that may have come in recently.

On Mac, one can start it like so:

/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --js-flags="--experimental-wasm-simd"

If you forget the command-line flag or get it wrong, the module compilation won’t validate the SIMD instructions and will throw an exception, so you can’t run by mistake. (This is also apparently how you’re meant to test for SIMD support presence, AFAIK… compile a module and see if it works?)

Beware there are a few serious bugs in the V8 implementation, which may trip you up. In particular watch out for broken splat which can produce non-deterministic errors. For this reason I recommend disabling autovectorization for now, since you have no control of workarounds on the clang end. Non-constant bit shifts also fail to validate, requiring a scalar workaround.

Vector ops in clang

If you’re not working in C/C++ but are implementing your own Wasm compiler or hand-writing Wasm source, skip this section! ;) You’ll want to checkout the SIMD proposal documentation for a list of available instructions.

First, forget anything you may have heard about emscripten including MMX compatibility headers (xmmintrin.h etc). They’ve been recently removed as they’re broken and misleading.

There’s also emscripten/vector.h but it seems obsolete as well with some references to functions that don’t exist that were from the old SIMD.js implementation, and I recommend avoiding it for now.

The good news is, a bunch of vector stuff “just works” using standard clang syntax, and there’s a few intrinsic functions for particular operations like the bitselect instruction and vector shuffling.

First, take some plain ol’ code and compile it with SIMD enabled:

emcc -o foo.html -O3 -s SIMD=1 -fno-vectorize foo.c

It’s important for now to disable autovectorization since it tends to break on the V8 splat bug for me. In the future, you’ll want to leave it on to squeeze out the occasional performance increase without manual intervention.

If you’re not using headers which predefine standard vector types for you, you can create some convenient aliases like so (these are just the types I’ve used so far):

typedef int16_t int16x8 __attribute((vector_size(16)));

typedef uint8_t uint8x16 __attribute((vector_size(16)));
typedef uint16_t uint16x8 __attribute((vector_size(16)));
typedef uint32_t uint32x4 __attribute((vector_size(16)));
typedef uint64_t uint64x2 __attribute((vector_size(16)));

The expected float and signed/unsigned integer interpretations for 128 bits are available, and you can freely cast between them to reinterpret bit sizes.

To work around bugs in the “splat” operation that expands a scalar to a vector, I made inline helper functions for myself:

static volatile int junk = 0;

static inline int16x8 splat_const(const int16_t val) {
// Use this only on constants due to the v8 splat bug!
return (int16x8){
val, val, val, val,
val, val, val, val
};
}

static inline int16x8 splat_vec(const int16_t val) {
// Try to work around issues with broken splat in V8
// by forcing the input to be something that won't be reused.
const int guarded = val + junk;
return (int16x8){
guarded, guarded, guarded, guarded,
guarded, guarded, guarded, guarded
};
}

Once the bug is fixed, it’ll be safe to remove the ‘junk’ and ‘guarded’ bits and use a single splat helper function. Though I’m sure there’s got to be a visibly clearer way to do a splat than manually writing out all the lanes and having the compiler coalesce them into a single splat operation? o_O

The bitselect operation is also frequently necessary, especially since convenient operations like min, max, and abs aren’t available on integer vectors. You might or might not be able to do this some cleaner way without the builtin, but this seems to work:

static inline int16x8 select_vec(const int16x8 cond,
const int16x8 a,
const int16x8 b) {
return (int16x8)__builtin_wasm_bitselect(a, b, cond);
}

Note the order of parameters on the bitselect instruction and the builtin has the condition last — I found this maddening so my helper function has the condition first where my code likes it.

You can now make your own vector abs function:

static inline int16x8 abs_vec(const int16x8 v) {
return select_vec(v < splat_const(0), -v, v);
}

Note the < and – operators “just work” on the vectors, we only needed helper functions for the bitselect and the splat. And I’m still not confident I need those two?

You’ll also likely need the vector shuffle operation, which there’s a standard builtin for. For instance here I’m deinterleaving 16-bit pixels into 8-bit pixels and the extra high bytes:

static inline uint8x16 merge_pixels(const int16x8 work) {
return (uint8x16)__builtin_shufflevector((uint8x16)work, (uint8x16)work,
0, 2, 4, 6, 8, 10, 12, 14, // the 8 pixels we worked on
1, 3, 5, 7, 9, 11, 13, 15 // zeroes we don't need
);
}

Checking compiler output

To figure out what’s going on it helps a lot to disassemble the WebAssembly output to confirm what instructions are actually emitted — especially if you’re hoping to report a bug. Refer to the SIMD proposal for details of instructions and types used.

If you compile with -g your .wasm output will include function names, which make it much easier to read the disassembly!

Use wasm-dis like so:

wasm-dis foo.wasm > foo.wat

Load up the .wat in your code editor of choice (there are syntax highlighting plugins available for VS Code and I think Atom etc) and search for your function in the mountain of stuff.

Note in particular that bit-shift operations currently can produce big sequences of lane-shuffling and scalar bit-shifts. This is due to the LLVM compiler working around a V8 bug with bit-shifts, and will be fixed soon I hope.

If you wish, you can modify the Wasm source in the .wat file and re-assemble it to test subtle changes — use wasm-as for this.

Reporting bugs

You probably will encounter bugs — this is very bleeding-edge stuff! The folks working on it want your feedback if you’re working in this area, so please make the most of it by providing reproducible test cases for any bugs you encounter that aren’t chalked down to the existing splat argument corruption and non-constant shift bugs.

And beware that until the splat bug is fixed, non-deterministic problems are really easy to pop up.

The various trackers:

ogv.js 1.6.0 released with experimental AV1 decoding

After some additional fixes and experiments I’ve tagged ogv.js 1.6.0 and released it. As usual you can use ‘ogv’ package on npm or fetch the zip manually. This includes various fixes, including for some weird bugs!, and performance improvements on lower-end machines. Internals have been partially refactored to aid future maintenance, and experimental AV1 decoding has been added using VideoLAN’s dav1d decoder.

dav1d and SIMD

The dav1d AV1 decoder is now working pretty solidly, but slowly. I found that my test files were encoded at too high a quality and dialed them back to my actual target bitrate and find that performance improves as a consequence, so hey! Not bad. ;)

I’ve worked around a minor compiler issue in emscripten’s old “fastcomp” asm.js->wasm backend where an inner loop didn’t get unrolled, which improves decode performance by a couple percent. Upstream prefers to let the unroll be implicit, so I’m keeping this patch in a local fork for now.

I’ve also been reached out to by some folks working on the WebAssembly SIMD proposal, which should allow speeding up some of the slow filtering operations with optimized vector code! The only browser implementation of the proposal (which remains a bit controversial) is currently Chrome, with an experimental command-line flag, and the updated vectorization code is in the new WebAssembly compiler backend that’s integrated with upstream LLVM.

So I spent some time getting up and running on the new LLVM backend for emscripten, found a few issues:

  • emsdk doesn’t update the LLVM download properly so you can get stuck on an old version and be very confused — this is being fixed shortly!
  • currently it’s hard to use a single sdk installation for both modes at once, and asm.js compilation requires the old backend. So I’ve temporarily disabled the asm.js builds on my simd2 work branch.
  • multithreaded builds are broken atm (at least modularized, which only just got fixed on the main compiler so might need fixes for llvm backend)
  • use of atomics intrinsics in a non-multithreaded build results in a validation error, whereas it had been silently turned into something safe in the old backend. I had to patch dav1d with a “fake atomics” option to #define them away.
  • Non-SIMD builds were coming out with data corruption, which I tracked down to an optimizer bug which had just been fixed upstream the day before I reported it. ;)
  • I haven’t gotten as far as working with any of the SIMD intrinsics, because I’m getting a memory access out of bounds issue when engaging the autovectorizer. I narrowed down a test case with the first error and have reported it; not sure yet whether the problem is in the compiler or in Chrome/V8.

In theory autovectorization is likely to not do much, but significant gains could be made using intrinsics… but only so much, as the operations available are limited and it’s hard to tell what will be efficient or not.

Intermittent file breakages

Very rarely, some files would just break at a certain position in the file for reasons I couldn’t explain. I had one turn up during AV1 testing where a particular video packet that contained two frame OBUs had one OBU header appear ok and the second obviously corrupted. I tracked the corruption back from the codec to the demuxer to the demuxer’s input buffer to my StreamFile abstraction used for loading data from a seekable URL.

Turned out that the offending packet straddled a boundary between HTTP requests — between the second and third megabytes of the file, each requested as a separate Range-based XMLHttpRequest, downloaded as binary strings so the data can be accessed during progress events. But according to the network panel, the second and third megabytes looked fine…. but the *following* request turned up as 512 KiB. …What?

Dumping the binary strings of the second and third megabytes, I immediately realized what was wrong:

Enjoy some tasty binary strings!

The first requests were as expected showing 8-bit characters (ASCII and control chars etc). The request with the broken packet was showing CJK characters indicating the string had probably been misinterpreted as UTF-16

It didn’t take much longer to confirm that the first two bytes of the broken request were 0xFE 0xFF, a UTF-16 Byte Order Mark. This apparently overrides the “overrideMimeType” method’s x-user-defined charset, and there’s no way to override it back. Hypothetically you could probably detect the case and swap bytes back but I think it’s not actually worth it to do full streaming downloads within chunks for the player — it’s better to buffer ahead so you can play reliably.

For now I’ve switched it to use ArrayBuffer XHRs instead of binary strings, which avoids the encoding problem but means data can’t be accessed until each chunk has finished downloading.

ogv.js experimental AV1 decoding

The upcoming ogv.js 1.6.0 release will be the first to include experimental AV1 support, using the dav1d decoder. Thanks to ePirat for the initial work in emscripten cross-compiling the dav1d codebase!

Performance is not very great, but may improve a bit in future from optimizations and potentially a lot from new platform features that may come to WebAssembly in the future.

In particular on Internet Explorer which lacks WebAssembly, performance is very poor but does work at very low resolutions on reasonably fast machines.

On my 2015 MacBook Pro (3.1 GHz 5th-gen Core i7), I can get somewhere between 360p and 480p on the “Caminandes – Llamigos” demo in Safari, while the current VP9 codec gives me 720p.

Safari has a great WebAssembly engine, giving 720p for VP9 or a solid 360p for AV1. 480p AV1 would be achievable with threading.

In IE 11, high-motion scenes in AV1 top up the CPU at only 120p, while VP9 gets away with 240p or so.

IE 11 runs several resolution steps lower, limited by its slow JavaScript engine. It will never get faster, we can only hope it will be gradually replaced.

Multithreaded WebAssembly builds are also included, thanks to emscripten fixing support for modularized threaded programs in 1.38.27. These however do not work in Safari because it has not yet added back SharedArrayBuffer support after it was removed as part of Spectre mitigations.

You can test the threaded builds in Chrome and Firefox with suitable flags enabled (“Wasm threading” for Chrome and “shared memory” for Firefox). VP9 scales well to 2 or 4 threads depending on the resolution, and AV1 scales to 2-ish threads. Will continue to tune and work on this for the future day when Safari supports threading.

Another area where WebAssembly doesn’t perform well is the lack of SIMD instructions — in many places there are tight loops of arithmetic that can be parallelized with vector computation, and native builds of the decoders make extensive use of SIMD. There is some experimental support in some browsers and emscripten but I’m not sure how well they talk to each other or how finalized the standard is so I haven’t tried it.

(It’s conceivable that browser engines could auto-vectorize tight loops in WebAssembly but they would probably be limited to 32-bit arithmetic at best, which wouldn’t parallelize as much as things that can work with 16-bit or 8-bit wide lanes.)

A peek at ogv.js updates

Between other projects, I’ve done a little maintenance on our ogv.js WebM player shim for Safari and IE. This should go out next week as 1.6.0 and should remain compatible with 1.5.x but has a lot of internals refactoring.

Experimental features

This release include an experimental AV1 video decoder using the dav1d library, based on ePirat’s initial work getting it to build with emscripten. This is an initial test, and needs to be brought back up to date with upstream, optimized, etc.

Also included are multithreaded VP8 and VP9 decoders, brought back from the past thanks to emscripten landing updated support and browsers getting closer to re-enabling SharedArrayBuffer. Performance on 2-core and 4+-core machines is encouraging in Firefox and Chrome with flags enabled, but cannot yet be tested in Safari.

SD videos scale over 2 cores reasonably well, with a solid 50% or more decoding speed boost visible. HD can scale over 4 cores, with about 200% speed boost.

Threading must be manually opted in. The AV1 decoder is not yet threaded.

Low-end performance work

In Internet Explorer 11, performance is weaker across the board versus other browsers because its JS engine is an old, frozen version that’s not been optimized for asm.js or WebAssembly. In addition, machines where people are running IE 11 are probably more likely to be older, slower machines.

I did some optimization work which greatly improves perceived playback performance of a 120p WebM VP9/Opus file on my slowest test machine, a 1.67 GHz Atom netbook tablet from 2012 or so.

  • more aggressively selecting software YCbCr-RGB conversion when WebGL is going to be worse
  • Microoptimizations to conversion and extraction of YCbCr data from heap
  • Replaced timer-based audio packet consolidation with buffer-size-based one
  • Tuned audio communication with Flash to reduce calls, per-call overhead
  • Increased number of buffered audio and video frames to survive short pauses better
  • Fix for dropping of individual frames when an adjacent decoded frame is available

I don’t think there’s a lot left to be done in optimizing VP9 decoding for IE; the largest components in profiling at low resolutions are now things like vpx_convolve8_c which really would benefit from integer multiplication and just not being so darn slow… I tried a replacement function micro-optimizing to factor out common additions and bit-shifts in the generated emscripten code but it didn’t seem to make any difference; either that’s the one bit the IE optimizer catches anyway or the bit-shifts are so cheap compared to the memory accesses and multiplications that it just makes no measurable difference. :P

Ah well! This is why I started producing files down to 120p, to handle the super-slow cases.

What is important is making sure that it plays audio smoothly, which it’s now better at than before. And in production we’ll probably add a notice for slow decoding recommending picking another browser…

Refactoring

I’ve also started refactoring work in preparation for future changes to support MSE-style buffering, needed for adaptive bitrate streaming.

Instead of a hodge-podge of closure functions and local variables, I’ve transitioned to using ES6 modules and classes, with a babel transform step for ES5 compatibility.

This helps distinguish retained state vs local variables, makes it easier to find state when debugging, and generally makes things easier to work with in the source. More cleanup still needs to be done in the larger processing functions and the various state vars, but it’s an improvement so far.

EmbedScript 2019

Based on my decision to stick with browser-based sandboxing instead of pushing on with ScriptinScript, I’ve started writing up notes reviving my old idea for an EmbedScript extension for MediaWiki. It’ll use the <iframe> sandbox and Content-Security-Policy to run widgets in article content (with a visible/interactive HTML area) and plugins (headless) for use in UI extensions backed by trusted host APIs which let the plugin perform limited actions with the user’s permission.

There are many details yet to be worked out, and I’ll keep updating that page for a while before I get to coding. In particular I want to confirm things like the proper CSP headers to prevent cross-origin network access (pretty sure I got it, but must test) and how to perform the equivalent sandboxing in a web view on mobile platforms! Ensuring that the sandbox is secure in a browser before loading code is important as well — older browsers may not support all the sandboxing needed.

I expect to iterate further on the widgets/plugins/host APIs model, probably to include a concept of composable libraries and access to data resources (images, fonts, audio/video files, even data from the wiki, etc).

The widget, plugin, and host API definitions will need to be stored and editable on-wiki — like a jsfiddle.net “fiddle” editing can present them as 4-up windows of HTML, CSS, JS, and output — but with additional metadata for dependencies and localizable strings. I hope to use MediaWiki’s new “multi-content revisions” system to store the multiple components as separate content pieces of a single wiki page, versioned together.

Making sure that content widgets can be fairly easy ported to/from non-MediaWiki platforms would be really wise though. A git repo adapter? A single-file .html exporter? Embedding the content as offsite-embeddable iframes as well, without the host page having API or data access? Many possibilities. Is there prior art in this area?

Also need to work out the best way to instantiate objects in pages from the markup end. I’d like for widgets in article pages to act like media files in that you can create them with the [[File:blah]] syntax, size them, add borders and captions, etc. But it needs to integrate well with the multimedia viewer zoom view, etc (something I still need to fix for video too!)… and exposing them as a file upload seems weird. And what about pushing one-off data parameters in? Desirable or extra complication?

Anyway, check the on-wiki notes and feel free to poke me with questions, concerns, and ideas y’all!

Defender of the Realm

I showed some iterations of ScriptinScript’s proposed object value representation, using native JS objects with a custom prototype chain to isolate the guest world’s JS objects. The more I looked I saw more corner cases… I thought of the long list of security issues with the old Caja transpiling embedding system, and decided it would be best to change course.

Not only are there a lot of things to get right to avoid leaking host objects, it’s simply a lot of work to create a mostly spec-compliant JavaScript implementation, and then to maintain it. Instead I plan to let the host JavaScript implementation run almost the entire show, using realms.

What’s a Realm?

Astute readers may have clicked on that link and noticed that the ECMAScript committee’s realms proposal is still experimental, with no real implementations yet… But realms are actually a part of JS already, there’s just no standard way to manipulate them! Every function is associated with a realm that it runs in, which holds the global object and the intrinsic objects we take for granted — say, Object. Each realm has its own instance of each of these instrinsics, so if an object from one realm does make its way to another realm, their prototype chains will compare differently.

That sounds like what we were manually setting up last time, right? The difference is that when native host operations like throwing exceptions in a built-in function, auto-boxing a primitive value to an object, etc happen, the created Error or String etc instance will have the realm-specific prototype without us having to guard for it and switch it around.

If we have a separate realm for the guest environment, then there are a lot fewer places we have to guard against getting host objects.

Getting a realm

There are a few possible ways we can manage to get ahold of a separate realm for our guest code:

  • Un-sandboxed <iframe>
  • Sandboxed <iframe>
  • Web Worker thread
  • ‘vm’ module for Node.js

It should be possible to combine some of these techniques, such as using the future-native Realm inside a Worker inside a sandboxed iframe, which can be further locked down with Content-Security-Policy headers!

Note that using sandboxed or cross-origin <iframe>s or Workers requires asynchronous messaging between host and guest, but is much safer than Realm or same-origin <iframe> because they prevent all object leakage.

Similar techniques are used in existing projects like Oasis to seeming good effect.

Keep it secret! Keep it safe!

To keep the internal API for the guest environment clean and prevent surprise leakages to the host realm, it’s probably wise to clean up the global object namespace and the contents of the accessible intrinsics.

This is less important if cross-origin isolation and Content-Security-Policy are locked down carefully, but probably still a good idea.

For instance you probably want to hide some things from guest code:

  • the global message-passing handlers for postMessage to implement host APIs
  • fetch and XMLHttpRequest for network access
  • indexedDB for local-origin info
  • etc

In an <iframe> you would probably want to hide the entire DOM to create a fresh realm… But if it’s same-origin I don’t quite feel confident that intrinsics/globals can be safely cleaned up enough to avoid escapes. I strongly, strongly recommend using cross-origin or sandboxed <iframe> only! And a Worker that’s loaded from an <iframe> might be best.

In principle the realm can be “de-fanged” by walking through the global object graph and removing any property not on an allow list. Often you can also replace a constructor or method with an alternate implementation… as long as its intrinsic version won’t come creeping back somewhere. Engine code may throw exceptions of certain types, for instance, so they may need pruning in their details as well as pruning from the global tree itself.

In order to provide host APIs over postMessage, keep local copies of the global’s postMessage and addEventListener in a closure and set them up before cleaning the globals. Be careful in the messaging API to use only local variable references, no globals, to avoid guest code interfering with the messaging code.

Whither transpiling?

At this point, with the host environment in a separate realm *and* probably a separate thread *and* with its globals and intrinsics squeeky clean… do we need to do any transpiling still?

It’s actually, I think, safe at that point to just pass JS code for strict mode or non-strict-mode functions in and execute it after the messaging kernel is set up. You should even be able to create runtime code with eval and the Function constructor without leaking anything to/from the host context!

Do we still even need to parse/transpile? Yes!

But the reason isn’t for safety, it’s more for API clarity, bundling, and module support… Currently there’s no way to load JS module code (with native import/export syntax) in a Worker, and there’s no way to override module URL-to-code resolution in <script type=”module”> in an <iframe>.

So to support modern JS modules for guest code, you’d need some kind of bundling… which is probably desired anyway for fetching common libraries and such… and which may be needed to combine the messaging kernel / globals cleanup bootstrap code with the guest code anyway.

There’s plenty of prior art on JS module -> bundle conversion, so this can either make use of existing tools or be inspired by it.

Debugging

If code is simply executed in the host engine, this means two things:

One, it’s hard to debug from within the web page because there aren’t tools for stopping the other thread and introspecting it.

Two, it’s easy to debug from within the web browser because the host debugger Just Works.

So this is probably good for Tools For Web Develepers To Embed Stuff but may be more difficult for Beginner’s Programming Tools (like the BASIC and LOGO environments of my youth) where you want to present a slimmed-down custom interface on the debugger.

Conclusions

Given a modern-browser target that supports workers, sandboxed iframes, etc, using those native host tools to implement sandboxing looks like a much, much better return on investment than continuing to implement a full-on interpreter or transpiler for in-process code.

This in some ways is a return to older plans I had, but the picture’s made a LOT clearer by not worrying about old browsers or in-process execution. Setting a minimal level of ES2017 support is something I’d like to do to expose a module-oriented system for libraries and APIs, async, etc but this isn’t strictly required.

I’m going to re-work ScriptinScript in four directions:

First, the embedding system using <iframe>s and workers for web or ‘vm’ for Node, with a messaging kernel and global rewriter.

Second, a module bundling frontend that produces ready-to-load-in-worker JS, that can be used client-side for interactive editing or server-side for pre-rendering. I would like to get the semantics of native JS modules right, but may approximate them as a simplification measure.

Third, a “Turtle World” demo implementing a much smaller interpreter for a LOGO-like language, connected to a host API implementing turtle graphics in SVG or <canvas>. This will scratch my itch to write an interpreter, but be a lot simpler to create and maintain. ;)

Finally, a MediaWiki extension that allows storing the host API and guest code for Turtle World in a custom wiki namespace and embedding them as media in articles.

I think this is a much more tractable plan, and can be tackled bit by bit.

ScriptinScript value representation

As part of my long-running side quest to make a safe, usable environment for user-contributed scripted widgets for Wikipedia and other web sites, I’ve started working on ScriptinScript, a modern JavaScript interpreter written in modern JavaScript.

It’ll be a while before I have it fully working, as I’m moving from a seat-of-the-pants proof of concept into something actually based on the language spec… After poking a lot at the spec details of how primitives and objects work, I’m pretty sure I have a good idea of how to represent guest JavaScript values using host JavaScript values in a safe, spec-compliant way.

Primitives

JavaScript primitive types — numbers, strings, symbols, null, and undefined — are suitable to represent themselves; pretty handy! They’re copyable and don’t expose any host environment details.

Note that when you do things like reading str.length or calling str.charCodeAt(index) per spec it’s actually boxing the primitive value into a String object and then calling a method on that! The primitive string value itself has no properties or methods.

Objects

Objects, though. Ah now that’s tricky. A JavaScript object is roughly a hash map of properties indexed with string or symbol primitives, plus some internal metadata such as a prototype chain relationship with other objects.

The prototype chain is similar, but oddly unlike, class-based inheritance typical in many other languages.

Somehow we need to implement the semantics of JavaScript objects as JavaScript objects, though the actual API visible to other script implementations could be quite different.

First draft: spec-based

My initial design modeled the spec behavior pretty literally, with prototype chains and property descriptors to be followed step by step in the interpreter.

Guest property descriptors live as properties of a this.props sub-object created with a null prototype, so things on the host Object prototype or the custom VMObject wrapper class don’t leak in.

If a property doesn’t exist on this.props when looking it up, the interpreter will follow the chain down through this.Prototype. Once a property descriptor is found, it has to be examined for the value or get/set callables, and handled manually.

// VMObject is a regular class
[VMObject] {
    // "Internal slots" and implementation details
    // as properties directly on the object
    machine: [Machine],
    Prototype: [VMObject] || null,

    // props contains only own properties
    // so prototype lookups must follow this.Prototype
    props: [nullproto] {
        // prop values are virtual property descriptors
        // like you would pass to Object.defineProperty()
        aDataProp: {
            value: [VMObject],
            writable: true,
            enumerable: true,
            configurable: true,
        },
        anAccessorProp: {
            get: [VMFunction],
            set: [VMFunction],
            enumerable: true,
            configurable: true,
        },
    },
}

Prototype chains

Handling of prototype chains in property lookups can be simplified by using native host prototype chains on the props object that holds the property descriptors.

Instead of Object.create(null) to make props, use Object.create(this.Prototype ? this.Prototype.props : null).

The object layout looks about the same as above, except that props itself has a prototype chain.

Property descriptors

We can go a step further, using native property descriptors which lets us model property accesses as direct loads and stores etc.

Object.defineProperty can be used directly on this.props to add native property descriptors including support for accessors by using closure functions to wrap calls into the interpreter.

This should make property gets and sets faster and awesomer!

Proper behavior should be retained as long as operations that can affect property descriptor handling are forwarded to props, such as calling Object.preventExtensions(this.props) when the equivalent guest operation is called on the VMObject.

Native objects

At this point, our inner props object is pretty much the “real” guest object, with all its properties and an inheritance chain.

We could instead have a single object which holds both “internal slots” and the guest properties…

let MachineRef = Symbol('MachineRef');

// VMObject is prototyped on a null-prototype object
// that does not descend from host Object, and which
// is named 'Object' as well from what guest can see.
// Null-proto objects can also be used, as long as
// they have the marker slots.
let VMObject = function Object(val) {
    return VMObject[MachineRef].ToObject(val);
};
VMObject[MachineRef] = machine;
VMObject.prototype = Object.create(null);
VMObject.prototype[MachineRef] = machine;
VMObject.prototype.constructor = VMObject;

[VMObject] || [nullproto] {
    // "Internal slots" and implementation details
    // as properties indexed by special symbols.
    // These will be excluded from enumeration and
    // the guest's view of own properties.
    [MachineRef]: [Machine],

    // prop values are stored directly on the object
    aDataProp: [VMObject],
    // use native prop descriptors, with accessors
    // as closures wrapping the interpreter.
    get anAccessorProp: [Function],
    set anAccessorProp: [Function],
}

The presence of the symbol-indexed [MachineRef] property tells host code in the engine that a given object belongs to the guest and is safe to use — this should be checked at various points in the interpreter like setting properties and making calls, to prevent dangerous scenarios like exposing the native Function constructor to create new host functions, or script injection via DOM innerHTML properties.

Functions

There’s an additional difficulty, which is function objects.

Various properties will want to be host-callable functions — things like valueOfand toString. You may also want to expose guest functions directly to host code… but if we use VMObject instances for guest function objects, then there’s no way to make them directly callable by the host.

Function re-prototyping

One possibility is to outright represent guest function objects using host function objects! They’d be closures wrapping the interpreter, and ‘just work’ from host code (though possibly careful in how they accept input).

However we’d need a function object that has a custom prototype, and there’s no way to create a function object that way… but you can change the prototype of a function that already has been instantiated.

Everyone says don’t do this, but you can. ;)

let MachineRef = Symbol('MachineRef');

// Create our own prototype chain...
let VMObjectPrototype = Object.create(null);
let VMFunctionPrototype = Object.create(VMObjectPrototype);

function guestFunc(func) {
    // ... and attach it to the given closure function!
    Reflect.setPrototypeOf(func, VMFunction.prototype);

    // Also save our internal marker property.
    func[MachineRef] = machine;
	return func;
}

// Create our constructors, which do not descend from
// the host Function but rather from VMFunction!
let VMObject = guestFunc(function Object(val) {
    let machine = VMObject[MachineRef];
    return machine.ToObject(val);
});

let VMFunction = guestFunc(function Function(src) {
    throw new Error('Function constructor not yet supported');
});

VMFunction.prototype = VMFunctionPrototype;
VMFunctionPrototype.constructor = VMFunction;

VMObject.prototype = VMObjectPrototype;
VMObjectPrototype.constructor = VMObject;

This seems to work but feels a bit … freaky.

Function proxying

An alternative is to use JavaScript’s Proxy feature to make guest function objects into a composite object that works transparently from the outside:

let MachineRef = Symbol('MachineRef');

// Helper function to create guest objects
function createObj(proto) {
    let obj = Object.create(proto);
    obj[MachineRef] = machine;
    return obj;
}

// We still create our own prototype chain...
let VMObjectPrototype = createObj(null);
let VMFunctionPrototype = createObj(VMObjectPrototype);

// Wrap our host implementation functions...
function guestFunc(func) {
    // Create a separate VMFunction instance instead of
    // modifying the original function.
    //
    // This object is not callable, but will hold the
    // custom prototype chain and non-function properties.
    let obj = createObj(VMFunctionPrototype);

    // ... now wrap the func and the obj together!
    return new Proxy(func, {
        // In order to make the proxy object callable,
        // the proxy target is the native function.
        //
        // The proxy automatically forwards function calls
        // to the target, so there's no need to include an
        // 'apply' or 'construct' handler.
        //
        // However we have to divert everything else to
        // the VMFunction guest object.
        defineProperty: function(target, key, descriptor) {
            if (target.hasOwnProperty(key)) {
                return Reflect.defineProperty(target, key, descriptor);
            }
            return Reflect.defineProperty(obj, key, descriptor);
        },
        deleteProperty: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.deleteProperty(target, key);
            }
            return Reflect.deleteProperty(obj, key);
        },
        get: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.get(target, key);
            }
            return Reflect.get(obj, key);
        },
        getOwnPropertyDescriptor: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.getOwnPropertyDescriptor(target, key);
            }
            return Reflect.getOwnPropertyDescriptor(obj, key);
        },
        getPrototypeOf: function(target) {
            return Reflect.getPrototypeOf(obj);
        },
        has: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.has(target, key);
            }
            return Reflect.has(obj, key);
        },
        isExtensible: function(target) {
            return Reflect.isExtensible(obj);
        },
        ownKeys: function(target) {
            return Reflect.ownKeys(target).concat(
                Reflect.ownKeys(obj)
            );
        },
        preventExtensions: function(target) {
            return Reflect.preventExtensions(target) &&
                Reflect.preventExtensions(obj);
        },
        set: function(target, key, val, receiver) {
            if (target.hasOwnProperty(key)) {
                return Reflect.set(target, key, val, receiver);
            }
            return Reflect.set(obj, key, val, receiver);
        },
        setPrototypeOf: function(target, proto) {
            return Reflect.setPrototypeOf(obj, proto);
        },
    });
}

// Create our constructors, which now do not descend from
// the host Function but rather from VMFunction!
let VMObject = guestFunc(function Object(val) {
    // The actual behavior of Object() is more complex ;)
    return VMObject[MachineRef].ToObject(val);
});

let VMFunction = guestFunc(function Function(args, src) {
    // Could have the engine parse and compile a new guest func...
    throw new Error('Function constructor not yet supported');
});

// Set up the circular reference between
// the constructors and protoypes.
VMFunction.prototype = VMFunctionPrototype;
VMFunctionPrototype.constructor = VMFunction;
VMObject.prototype = VMObjectPrototype;
VMObjectPrototype.constructor = VMObject;

There’s more details to work out, like filling out the VMObject and VMFunction prototypes, ensuring that created functions always have a guest prototype property, etc.

Note that implementing the engine in JS’s “strict mode” means we don’t have to worry about bridging the old-fashioned arguments and caller properties, which otherwise couldn’t be replaced by the proxy because they’re non-configurable.

My main worries with this layout are that it’ll be hard to tell host from guest objects in the debugger, since the internal constructor names are the same as the external constructor names… the [MachineRef] marker property should help though.

And secondarily, it’s easier to accidentally inject a host object into a guest object’s properties or a guest function’s arguments…

Blocking host objects

We could protect guest objects from injection of host objects using another Proxy:

function wrapObj(obj) {
    return new Proxy(obj, {
        defineProperty: function(target, key, descriptor) {
            let machine = target[MachineRef];
            if (!machine.isGuestVal(descriptor.value) ||
                !machine.isGuestVal(descriptor.get) ||
                !machine.isGuestVal(descriptor.set)
            ) {
                throw new TypeError('Cannot define property with host object as value or accessors');
            }
            return Reflect.defineProperty(target, key, descriptor);
        },
        set: function(target, key, val, receiver) {
            // invariant: key is a string or symbol
            let machine = target[MachineRef];
            if (!machine.isGuestVal(val)) {
                throw new TypeError('Cannot set property to host object');
            }
            return Reflect.set(target, key, val, receiver);
        },
        setPrototypeOf: function(target, proto) {
            let machine = target[MachineRef];
            if (!machine.isGuestVal(val)) {
                throw new TypeError('Cannot set prototype to host object');
            }
            return Reflect.setPrototypeOf(obj, proto);
        },
    };
}

This may slow down access to the object, however. Need to benchmark and test some more and decide whether it’s worth it.

For functions, can also include the `apply` and `construct` traps to check for host objects in arguments:

function guestFunc(func) {
    let obj = createObj(VMFunctionPrototype);
    return new Proxy(func, {
        //
        // ... all the same traps as wrapObj and also:
        //
        apply: function(target, thisValue, args) {
            let machine = target[MachineRef];
            if (!machine.isGuestVal(thisValue)) {
                throw new TypeError('Cannot call with host object as "this" value');
            }
            for (let arg of args) {
                if (!machine.isGuestVal(arg)) {
                    throw new TypeError('Cannot call with host object as argument');
                }
            }
            return Reflect.apply(target, thisValue, args);
        },
        construct: function(target, args, newTarget) {
            let machine = target[MachineRef];
            for (let arg of args) {
                if (!machine.isGuestVal(arg)) {
                    throw new TypeError('Cannot construct with host object as argument');
                }
            }
            if (!machine.isGuestVal(newTarget)) {
                throw new TypeError('Cannot construct with host object as new.target');
            }
            return Reflect.apply(target, args, newTarget);
        },
    });
}

Exotic objects

There are also “exotic objects”, proxies, and other funky things like Arrays that need to handle properties differently from a native object… I’m pretty sure they can all be represented using proxies.

Next steps

I need to flesh out the code a bit more using the new object model, and start on spec-compliant versions of interpreter operations to get through a few simple test functions.

Once that’s done, I’ll start pushing up the working code and keep improving it. :)

Update (benchmarks)

I did some quick benchmarks and found that, at least in Node 11, swapping out the Function prototype doesn’t appear to harm call performance while using a Proxy adds a fair amount of overhead to short calls.

$ node protobench.js 
empty in 22 ms
native in 119 ms
guest in 120 ms

$ node proxybench.js
empty in 18 ms
native in 120 ms
guest in 1075 ms

This may not be significant when functions have to go through the interpreter anyway, but I’ll consider whether the proxy is needed and weigh the options…

Update 2 (benchmarks)

Note that the above benchmarks don’t reflect another issue — de-optimization of call sites that accept user-provided callbacks, if you sometimes pass them regular functions and other times pass them re-prototyped or proxied objects, they can switch optimization modes and end up slightly slower also when passed regular functions.

If you know you’re going to pass a guest object into a separate place that may be interchangeable with a native host function, you can make a native wrapper closure around the guest call and it should avoid this.

ScriptinScript is coming

Got kinda sidetracked for the last week and ended up with a half-written JavaScript interpreter written in JavaScript, which I’m calling “ScriptinScript”. O_O

There are such things already in existence, but they all seem outdated, incomplete, unsafe, or some combination of those. I’ll keep working on this for my embeddable widgets project but have to get back to other projects for the majority of my work time for now… :)

I’ve gotten it to a stage where I understand more or less how the pieces go together, and have been documenting how the rest of it will be implemented. Most of it for now is a straightforward implementation of the language spec as native modern JS code, but I expect it can be optimized with some fancy tricks later on. I think it’s important to actually implement a real language spec rather than half-assing a custom “JS-like” language, so code behaves as you expect it to … and so we’re not stuck with some totally incompatible custom tool forever if we deploy things using it.
Will post the initial code some time in the next week or two once I’ve got it running again after some major restructuring from initial proof of concept to proper spec-based behavior.

Parallelizing PNG, part 9: writing a C API in Rust

In my last posts I explored some optimization strategies inside the Rust code for mtpng, a multithreaded PNG encoder I’m creating. Now that it’s fast, and the remaining features are small enough to pick up later, let’s start working on a C API so the library can be used by C/C++-based apps.

If you search the web you’ll find a number of tutorials on the actual FFI-level interactions, so I’ll just cover a few basics and things that stood out. The real trick is in making a good build system; currently Rust’s “Cargo” doesn’t interact well with Meson, for instance, which really wants to build everything out-of-tree. I haven’t even dared touch autoconf. ;)

For now, I’m using a bare Makefile for Unix (Linux/macOS) and a batch file for Windows (ugh!) to drive the building of a C-API-exercising test program. But let’s start with the headers and FFI code!

Contracts first: make a header file

The C header file (“mtpng.h“) defines the types, constants, and functions available to users of the library. It defines the API contract, both in code and in textual comments because you can’t express things like lifetimes in C code. :)

Enums

Simple enums (“fieldless enums” as they’re called) can be mapped to C enums fairly easily, but be warned the representation may not be compatible. C enums usually (is that the spec or just usually??) map to the C ‘int’ type, which on the Rust side is known as c_int. (Clever!)

//
// Color types for mtpng_encoder_set_color().
//
typedef enum mtpng_color_t {
    MTPNG_COLOR_GREYSCALE = 0,
    MTPNG_COLOR_TRUECOLOR = 2,
    MTPNG_COLOR_INDEXED_COLOR = 3,
    MTPNG_COLOR_GREYSCALE_ALPHA = 4,
    MTPNG_COLOR_TRUECOLOR_ALPHA = 6
} mtpng_color;

So the representation is not memory-compatible with the Rust version, which explicitly specifies to fit in a byte:

#[derive(Copy, Clone)]
#[repr(u8)]
pub enum ColorType {
    Greyscale = 0,
    Truecolor = 2,
    IndexedColor = 3,
    GreyscaleAlpha = 4,
    TruecolorAlpha = 6,
}

If you really need to ship the enums around bit-identical, use #[repr(c_int)]. Note also that there’s no enforcement that enum values transferred over the FFI boundary will have valid values! So always use a checked transfer function on input like this:

impl ColorType {
    pub fn from_u8(val: u8) -> io::Result<ColorType> {
        return match val {
            0 => Ok(ColorType::Greyscale),
            2 => Ok(ColorType::Truecolor),
            3 => Ok(ColorType::IndexedColor),
            4 => Ok(ColorType::GreyscaleAlpha),
            6 => Ok(ColorType::TruecolorAlpha),
            _ => Err(other("Invalid color type value")),
        }
}

More complex enums can contain fields, and things get complex. For my case I found it simplest to map some of my mode-selection enums into a shared namespace, where the “Adaptive” value maps to something that doesn’t fit in a byte and so could not be valid, and the “Fixed” values map to their contained byte values:

#[derive(Copy, Clone)]
pub enum Mode<T> {
    Adaptive,
    Fixed(T),
}

#[repr(u8)]
#[derive(Copy, Clone)]
pub enum Filter {
    None = 0,
    Sub = 1,
    Up = 2,
    Average = 3,
    Paeth = 4,
}

maps to C:

typedef enum mtpng_filter_t {
    MTPNG_FILTER_ADAPTIVE = -1,
    MTPNG_FILTER_NONE = 0,
    MTPNG_FILTER_SUB = 1,
    MTPNG_FILTER_UP = 2,
    MTPNG_FILTER_AVERAGE = 3,
    MTPNG_FILTER_PAETH = 4
} mtpng_filter;

And the FFI wrapper function that takes it maps them to appropriate Rust values.

Callbacks and function pointers

The API for mtpng uses a few callbacks, required for handling output data and optionally as a driver for input data. In Rust, these are handled using the Write and Read traits, and the encoder functions are even generic over them to avoid having to make virtual function calls.

In C, the traditional convention is followed of passing function pointers and a void* as “user data” (which may be NULL, or a pointer to a private state structure, or whatever floats your boat).

In the C header file, the callback types are defined for reference and so they can be validated as parameters:

typedef size_t (*mtpng_read_func)(void* user_data,
                                  uint8_t* p_bytes,
								  size_t len);

typedef size_t (*mtpng_write_func)(void* user_data,
                                   const uint8_t* p_bytes,
                                   size_t len);

typedef bool (*mtpng_flush_func)(void* user_data);

On the Rust side we must define them as well:

pub type CReadFunc = unsafe extern "C"
    fn(*const c_void, *mut uint8_t, size_t) -> size_t;

pub type CWriteFunc = unsafe extern "C"
    fn(*const c_void, *const uint8_t, size_t) -> size_t;

pub type CFlushFunc = unsafe extern "C"
    fn(*const c_void) -> bool;

Note that the function types are defined as unsafe (so must be called from within an unsafe { … } block or another unsafe function), and extern “C” which defines them as using the platform C ABI. Otherwise the function defs are pretty standard, though they use C-specific types from the libc crate.

Note it’s really important to use the proper C types because different platforms may have different sizes of things. Not only do you have the 32/64-bit split, but 64-bit Windows has a different c_long type (32 bits) than 64-bit Linux or macOS (64 bits)! This way if there’s any surprises, the compiler will catch it when you build.

Let’s look at a function that takes one of those callbacks:

extern mtpng_result
mtpng_encoder_write_image(mtpng_encoder* p_encoder,
                          mtpng_read_func read_func,
						  void* user_data);
#[no_mangle]
pub unsafe extern "C"
fn mtpng_encoder_write_image(p_encoder: PEncoder,
                             read_func: Option<CReadFunc>,
                             user_data: *const c_void)
-> CResult
{
    if p_encoder.is_null() {
        CResult::Err
    } else {
        match read_func {
            Some(rf) => {
                let mut reader = CReader::new(rf, user_data);
                match (*p_encoder).write_image(&mut reader) {
                    Ok(()) => CResult::Ok,
                    Err(_) => CResult::Err,
                }
            },
            _ => {
                CResult::Err
            }
        }
    }
}

Note that in addition to the unsafe extern “C” we saw on the function pointer definitions, the exported function also needs to use #[no_mangle]. This marks it as using a C-compatible function name, otherwise the C linker won’t find it by name! (If it’s an internal function you want to pass by reference to C, but not expose as a symbol, then you don’t need that.)

Notice that we took an Option<CReadFunc> as a parameter value, not just a CReadFunc. This is needed so we can check for NULL input values, which map to None. while valid values map to Some(CReadFunc). (The actual pointer to the struct is more easily checked for NULL, since that’s inherent to pointers.)

The actual function is passed into a CReader, a struct that implements the Read trait by calling the function pointer:

pub struct CReader {
    read_func: CReadFunc,
    user_data: *const c_void,
}

impl CReader {
    fn new(read_func: CReadFunc,
           user_data: *const c_void)
    -> CReader
    {
        CReader {
            read_func: read_func,
            user_data: user_data,
        }
    }
}

impl Read for CReader {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let ret = unsafe {
            (self.read_func)(self.user_data,
                             &mut buf[0],
                             buf.len())
        };
        if ret == buf.len() {
            Ok(ret)
        } else {
            Err(other("mtpng read callback returned failure"))
        }
    }
}

Opaque structs

Since I’m not exposing any structs with public fields, I’ve got a couple of “opaque struct” types on the C side which are used to handle pointing at the Rust structs from C. Not a lot of fancy-pants work is needed to marshal them; the pointers on the C side pass directly to pointers on the Rust side and vice versa.

typedef struct mtpng_threadpool_struct
    mtpng_threadpool;

typedef struct mtpng_encoder_struct
    mtpng_encoder;

One downside of opaque structs on the C side is that you cannot allocate them on the stack, because the compiler doesn’t know how big they are — so we must allocate them on the heap and explicitly release them.

In Rust, it’s conventional to give structs an associated “new” method, and/or wrap them in a builder pattern to set up options. Here I wrapped Rayon’s ThreadPool builder with a single function that takes a number of threads, boxes it up on the heap, and returns a pointer to the heap object:

extern mtpng_result
mtpng_threadpool_new(mtpng_threadpool** pp_pool,
                     size_t threads);
pub type PThreadPool = *mut ThreadPool;

#[no_mangle]
pub unsafe extern "C"
fn mtpng_threadpool_new(pp_pool: *mut PThreadPool,
	                    threads: size_t)
-> CResult
{
    if pp_pool.is_null() {
        CResult::Err
    } else {
        match ThreadPoolBuilder::new().num_threads(threads).build() {
            Ok(pool) => {
                *pp_pool = Box::into_raw(Box::new(pool));
                CResult::Ok
            },
            Err(_err) => {
                CResult::Err
            }
        }
    }
}

The real magic here is Box::into_raw() which replaces the Box<ThreadPool> smart pointer with a raw pointer you can pass to C. This means there’s no longer any smart management or releasing, so it’ll outlive the function… and we need an explicit release function:

extern mtpng_result
mtpng_threadpool_release(mtpng_threadpool** pp_pool);
#[no_mangle]
pub unsafe extern "C"
fn mtpng_threadpool_release(pp_pool: *mut PThreadPool)
-> CResult
{
    if pp_pool.is_null() {
        CResult::Err
    } else {
        drop(Box::from_raw(*pp_pool));
        *pp_pool = ptr::null_mut();
        CResult::Ok
    }
}

Box::from_raw() turns the pointer back into a Box<ThreadPool>, which de-allocates the ThreadPool at the end of function scope.

Lifetimes

Annotating object lifetimes in this situation is ….. confusing? I’m not sure I did it right at all. The only lifetime marker I currently use is the for thread pool, which must live at least as long as the encoder struct.

As a horrible hack I’ve defined the CEncoder to use static lifetime for the threadpool, which seems …. horribly wrong. I probably don’t need to do it like this. (Guidance and hints welcome! I will update the post and the code! :D)

// Cheat on the lifetimes?
type CEncoder = Encoder<'static, CWriter>;

Then the encoder creation, which takes an optional ThreadPool pointer and required callback function pointers, looks like:

extern mtpng_result
mtpng_encoder_new(mtpng_encoder** pp_encoder,
                  mtpng_write_func write_func,
                  mtpng_flush_func flush_func,
                  void* const user_data,
                  mtpng_threadpool* p_pool);
pub type PEncoder = *mut CEncoder;

#[no_mangle]
pub unsafe extern "C"
fn mtpng_encoder_new(pp_encoder: *mut PEncoder,
                     write_func: Option<CWriteFunc>,
                     flush_func: Option<CFlushFunc>,
                     user_data: *const c_void,
                     p_pool: PThreadPool)
-> CResult
{
    if pp_encoder.is_null() {
        CResult::Err
    } else {
        match (write_func, flush_func) {
            (Some(wf), Some(ff)) => {
                let writer = CWriter::new(wf, ff, user_data);
                if p_pool.is_null() {
                    let encoder = Encoder::new(writer);
                    *pp_encoder = Box::into_raw(Box::new(encoder));
                    CResult::Ok
                } else {
                    let encoder = Encoder::with_thread_pool(writer, &*p_pool);
                    *pp_encoder = Box::into_raw(Box::new(encoder));
                    CResult::Ok
                }
            },
            _ => {
                CResult::Err
            }
        }
    }
}

Note how we take the p_pool pointer and turn it into a Rust reference by dereferencing the pointer (*) and then re-referencing it (&). :)

Because we’re passing the thread pool across a safe/unsafe boundary, it’s entirely the caller’s responsibility to uphold the compiler’s traditional guarantee that the pool instance outlives the encoder. There’s literally nothing to stop it from being released early by C code.

Calling from C

Pretty much all the external-facing functions return a a result status enum type, which I’ve mapped the Rust Result<_,Error(_)> system into. For now it’s just ok or error states, but will later add more detailed error codes:

Since C doesn’t have a convenient “?” syntax or try! macro for trapping those, I wrote a manual TRY macro for my samples’s main(). Ick!

#define TRY(ret) { \
    mtpng_result _ret = (ret); \
    if (_ret != MTPNG_RESULT_OK) { \
        fprintf(stderr, "Error: %d\n", (int)(_ret)); \
        retval = 1; \
        goto cleanup; \
    }\
}

The calls are then wrapped to check for errors:

int main() {

... some state setup ...

	mtpng_threadpool* pool;
    TRY(mtpng_threadpool_new(&pool, threads));

    mtpng_encoder* encoder;
    TRY(mtpng_encoder_new(&encoder,
                          write_func,
                          flush_func,
                          (void*)&state,
                          pool));

    TRY(mtpng_encoder_set_chunk_size(encoder, 200000));

    TRY(mtpng_encoder_set_size(encoder, 1024, 768));
    TRY(mtpng_encoder_set_color(encoder, color_type, depth));
    TRY(mtpng_encoder_set_filter(encoder, MTPNG_FILTER_ADAPTIVE));

    TRY(mtpng_encoder_write_header(encoder));
    TRY(mtpng_encoder_write_image(encoder, read_func, (void*)&state));
    TRY(mtpng_encoder_finish(&encoder));

cleanup:
    if (encoder) {
        TRY(mtpng_encoder_release(&encoder));
    }
    if (pool) {
        TRY(mtpng_threadpool_release(&pool));
    }

    printf("goodbye\n");
    return retval;
}

Ok, mostly straightforward right? And if you don’t like the TRY macro you can check error returns manually, or whatever. Just don’t forget to check them! :D

Error state recovery other than at API boundary checks may or may not be very good right now, I’ll clean that up later.

Now I’m pretty sure some things will still explode if I lean the wrong way on this system. For instance looking above, I’m not convinced that the pointers on the stack will be initialized to NULL except on debug builds. :D

Don’t forget the callbacks

Oh right, we needed read and write callbacks! Let’s put those back in. Start with a state structure we can use (it doesn’t have to be the same state struct for both read and write, but it is here cause why not?)

typedef struct main_state_t {
    FILE* out;
    size_t width;
    size_t bpp;
    size_t stride;
    size_t y;
} main_state;

static size_t read_func(void* user_data,
	                    uint8_t* bytes,
	                    size_t len)
{
    main_state* state = (main_state*)user_data;
    for (size_t x = 0; x < state->width; x++) {
        size_t i = x * state->bpp;
        bytes[i] = (x + state->y) % 256;
        bytes[i + 1] = (2 * x + state->y) % 256;
        bytes[i + 2] = (x + 2 * state->y) % 256;
    }
    state->y++;
    return len;
}

static size_t write_func(void* user_data,
	                     const uint8_t* bytes,
	                     size_t len)
{
    main_state* state = (main_state*)user_data;
    return fwrite(bytes, 1, len, state->out);
}

static bool flush_func(void* user_data)
{
    main_state* state = (main_state*)user_data;
    if (fflush(state->out) == 0) {
        return true;
    } else {
        return false;
    }
}

Couple of things stand out to me: first, this is a bit verbose for some common cases.

If you’re generating input row by row like in this example, or reading it from another source of data, the read callback works ok though you have to set up some state. If you already have it in a buffer, it’s a lot of extra hoops. I’ve added a convenience function for that, which I’ll describe in more detail in a later post due to some Rust-side oddities. :)

And writing to a stdio FILE* is probably really common too. So maybe I’ll set up a convenience function for that? Don’t know yet.

Building the library

Oh right! We have to build this code don’t we, or it won’t actually work.

Start with the library itself. Since we’re creating everything but the .h file in Rust-land, we can emit a shared library directly from the Cargo build system by adding a ‘cdylib’ target. In our Cargo.toml:

[lib]
crate-type = ["rlib", "cdylib"]

The “rlib” is a regular Rust library; the “cdylib” is a C-compatible shared library that exports only the C-compatible public symbols (mostly). The rest of the Rust standard library (the parts that get used) are compiled statically inside the cdylib, so they don’t interfere with other libraries that might have been built in a similar way.

Note this means that while a Rust app that uses mtpng and another rlib can share a Rayon ThreadPool instance, a C app that uses mtpng and another cdylib cannot share ThreadPools because they might be different versions etc.

Be warned that shared libraries are complex beasts, starting with the file naming! On Linux and most other Unix systems, the output file starts with “lib” and ends with “.so” (libmtpng.so). But on macOS, it ends in “.dylib” (libmtpng.dylib). And on Windows, you end up with both “mtpng.dll” which is linked at runtime and an “mtpng.dll.lib” which is linked at compile time (and really should be “mtpng.lib” to follow normal conventions on Windows, I think).

I probably should wrap the C API and the cdylib output in a feature flag, so it’s not added when building pure-Rust apps. Todo!

For now, the Makefile or batch file are invoking Cargo directly, and building in-tree. To build cleanly out of tree there are some options on Cargo to specify the target dir and (in nightly toolchain) the work output dir. This seems to be a work in progress, so I’m not worrying about the details too much yet, but if you need to get something working soon that integrates with autotools, check out GNOME’s librsvg library which has been migrating code from C to Rust over the last couple years and has some relevant build hacks. :)

Once you have the output library file(s), you put them in the appropriate place and use the appropriate system-specific magic to link to them in your C program like for any shared library.

The usual gotchas apply:

  • Unix (Linux/macOS) doesn’t like libraries that aren’t in the standard system locations. You have to add an -L param with the path to the library to your linker flags to build against the library in-tree or otherwise not standardly located.
  • Oh and Unix also hates to load libraries at runtime for the same reason! Use LD_LIBRARY_PATH or DYLD_LIBRARY_PATH when running your executable. Sigh.
  • But sometimes macOS doesn’t need that because it stores the relative path to your library into your executable at link time. I don’t even understand that Mach-O format, it’s crazy! ;)
  • Windows will just load any library out of the current directory, so putting mtpng.dll in with your executable should work. Allll right!

It should also be possible to build as a C-friendly static library, but I haven’t fiddled with that yet.

Windows fun

As an extra fun spot on Windows when using the MSVC compiler and libc… Cargo can find the host-native build tools for you, but gets really confused when you try to cross-build 32-bit on 64-bit sometimes.

And MSVC’s setup batch files are … hard to find reliably.

In the current batch file I’ve had some trouble getting the 32-bit build working, but 64-bit is ok. Yay!

Next steps

There’s a lot of little things left to do on mtpng: adding support for more chunks so it can handle indexed color images, color profiles, comments, etc… Ensuring the API is good and error handling is correct… Improving that build system. :)

And I think I can parallelize filtering and deflate better within a chunk, making compression faster for files too small for a single block. The same technique should work in reverse for decoding too.

Well, it’s been a fun project and I really think this is going to be useful as well as a great learning experience. :) Expect a couple more update posts in this series in the next couple weeks on those tuning issues, and fun little things I learn about Rust along the way!