Uncategorized – brionv

Flight sim screen recording helper script

I’ve been taking short clips from my Flight Simulator adventures and making small (sub-4 megabyte) .mp4 clips that I can upload to the forum or social media. The recordings are 3440×1440 up to 60fps, in HDR, and are both large and long.

I usually trim out clips using the trim tool in QuickTime Player on a Mac (yay fast networks), then save a shorter file I can work with in ffmpeg.

Ended up writing a script to automate this; it’ll pick a bitrate to fit within 3,500,000 bytes (optionally including audio), scale down and crop or pad to 16:9, and tone-map the HDR down to SDR for common-denominator playback.

Because there has to be a cursed element, the script is in command-line PHP. :D

https://brionv.com/git/brion/pack-vid/src/branch/main/pack-vid

A few examples:

Tanarg 912 trike

Grand Canyon in Kodiak 100

Chicago in Aermacchi MB-339

Chicago in Monster XCub

Atari Mandelbrot fractal: imul16

Another nostalgia+practice project I’m poking at on the Atari 800 XL is a Mandelbrot fractal generator, which I’m still in early stages of work on. This is mostly an exercise in building a 16-bit integer multiplier for the MOS 6502 processor, which has only addition, subtraction, and bit-shift operations.

The Mandelbrot set consists of those complex-plane points which, when iterating over z_i+1 = z_i^2 + c (where c is the input coordinate and z_0 = 0) never escape beyond |z_i| > 2. Surprisingly this creates a really cool shape, which has been the subject of fascination for decades:

https://en.wikipedia.org/wiki/Mandelbrot_set#/media/File:Mandel.png

To implement this requires three multiplications per iteration, to calculate zx^2, zy^2, and zy*zx. Famous PC fractal program Fractint used for low zooms a 16-bit integer size, which is good because anything bigger gets real slow!

For higher zooms Fractint used a 32-bit integer and 29 fractional bits for Mandelbrot and Julia sets, which leaves a range from -4..3.9, plenty big enough. For the smaller 16 bit size that means 3.13 layout, should be plenty for a few zooms in on a 160×192 screen. :D Multiplication creates a 32-bit integer with twice the integer bits, so 6.26 with a larger range which covers the addition results for the new zx and zy values.

These then need to be shifted back up and multiplied to get zx^2, zy^2, and zy*zx for the next iteration; the boundary condition is zx^2 + zy^2 >= 4.

imul16

Integer multiplication when you have binary shifts and addition only is kinda slow and super annoying. Because you have to do several operations for each bit, every cycle adds up — a single 16-bit add is just 18 cycles while a multiply can run several *hundred* cycles, and varies based on input.

Note that a 650-cycle function means a runtime of about a half a millisecond on average (1.79 MHz processor, with about 30% of cycles taken by the display DMA). The whole shebang could easily take 2-3 ms per iteration with three multiplications and a number of additions and shifts.

Basically, for each bit in one operand, you either add, or don’t add, the other operand with the corresponding bitshift to the result. If you’re dealing with signed integers you need to either sign-extend the operands to 32 bits or negate the inputs and keep track of whether you need to negate the output; not extending can be faster because you can assume the top 16 bits are 0 and shortcut some operations. ;)

Status and next steps

imul16 seems to be working, though could maybe use more tuning. I’ve sketched out the mandelbrot iteration function but haven’t written it yet.

Another trick Fractint used was trying to avoid having to go to max iterations within the “Mandelbrot lake” by adding a check for periodic repetition; apparently when working with finite precision often you end up with the operations converging on a repeating sequence of zx & zy values that end up yielding themselves after one or a few iterations; these will never escape the boundary condition, so it’s safe to cut off without going to max iterations. I’ll have to write something up with a little buffer of saved values, perhaps only activated by an adjacent max-iters pixel.

Once the internals are working I’ll wrap a front-end on it: 4-color graphics display, and allow a point-n-zoom with arrow keys or maybe joystick. :D

Atari photo/video viewer project

I recently picked up a vintage Atari 800 XL computer like one I had as a kid in the 1980s, and have been amusing myself learning more the low-level programming in that constrained environment.

The 8-bit Atari graphics are good for 1979 but pretty primitive; some sprite-like overlays (“player/missile graphics”) and a background that can either be character-mapped or a bitmap, trading off resolution for colors: 320×192 at 2 colors, 160×192 at 4 colors, or 80×192 at 9 colors (limited by the number of palette registers handy when they implemented the extended modes).

This not only means you have relatively few colors available for photorealistic images, but a 40 byte * 192 line framebuffer is 7680 bytes, a large amount for a computer with a 64KB address space.

However you have a lot of flexibility too: any scanline can switch modes in the display list so you can mix high-res text or instruments with many-colored playfields, and you can change palette registers between scanlines if you get the timing right.

I wondered whether video would be possible — if you go for the high res mode, and *do* touch every pixel, how long would it take to process a frame? Well, I did the numbers and it’s a *terrible* frame rate. BUT — if you had uncompressed frames ready to go in RAM or ROM, you can easily cycle between frames at 60 Hz, letting the display processor see each fresh frame.

With enough bank-switched ROM to back it, you could stream 60 Hz video at 480 KiB per second. A huge amount of data for its day, but now you could put a processed GIF animation onto a cartridge. ;)

So I’ve got a few things I want to explore on this fun project:

dithering to 4 colors, with per-scanline palettes (working as of December 2022)
can you also embed audio? 4-bit PCM at 15.8 or 7.9 KHz (working at 7.9; 15.8 may require a tweak)
try adding a temporal component to the error-diffusion dithering
add a couple lines of text mode for subtitles/captions

Dithering and palette selection

I’ve got a dither implementation hacked together in JS which reads in an image, sizes it, and then walks through the scanlines doing an error-propagation dither combined with a palette reduction.

To start with, the complete Atari palette is 7 bits (3 bits luminance, 4 bits hue, where 0 is grayscale and 1-15 are various points around the NTSC QI hue wheel). I took an RGB list of the colors from the net and, after gamma adjustment to linear space, perform an error-diffusion dither that looks for the closest color from the available palette then divides up the difference from the original color among neighboring pixels. At the end of the scanline, we count how many colors were used, including black which cannot be changed. If the remaining colors are > 3, they’re ranked based on usage and closeness and the least scoring color is removed. This is continued until the dither selects only colors that fit.

Formatting and playback

Due to a quirk of the Atari’s display processor, a frame buffer can’t cross a 4096-byte boundary, so with a 40-byte screen width you have to divide it into two non-contiguous sections. Selecting a widescreen aspect ratio (also to leave room for captions later) means there’s room enough to fit in arrays for the palettes as well (3 bytes per scanline) an to fit audio (131 or 262 bytes depending on sample rate).

Note that for extra fun, the hardware register that gives you the current scanline number gives you the count *divided by two*. This is because the whole signal has 262 scanlines per frame, which is bigger than 256 and doesn’t fit in a byte! :D

So it makes sense to handle these by waiting until we’re synced up on line 0 and then doing an explicit timing loop with horizontal blanking waits (STA WSYNC). This way we know if we’re on the 0 or the 1 subline, and can use the VCOUNT register (0..130) as an index into arrays of palette or audio entries.

For testing without simulating bank-switching, I’m squishing two frames into RAM and switching between the two by making a complex display list: basically just the same thing twice, but pointing at different frame buffers and looping back around.

It seems to work pretty nice! But the timing is tight and I have to disable interrupts.

Audio

The Atari doesn’t have DMA-based PCM audio where you just slap in some bytes and it plays the audio… you either use the square-wave generators, or you manually set the volume level of the voices for each sample *at the right time*.

Using the scan line frequency is handy since we’re already in there changing palette entries during horizontal blanking. Every freq is about 15.8 KHz, every other line is 7.9 KHz, slightly worse than telephone frequency.

It seems to work at 7.9 at least, and I might be able to do 15.8 with ROM backing (bank-switching every frame makes things easier vs a long buffer in RAM). Note that you only get 4 bits of precision, and unpacking two samples from one byte is annoyingly expensive. ;)

Next steps

The next thing I’ll try is a tweak to the dither algorithm to try to drive a more direct dither pattern between temporally adjacent frames; at least on an LCD, the 60 Hz flip looks great and it should “blend” even better on a classic CRT with longer phosphor retention times.

Then I’ll see if I can make a 1 MiB bank-switched cartridge image from the assembler that I can load in the emulator (and eventually flash onto a cartridge I can get for the physical device) so I can try running some longer animations/videos.

No rush though; I gotta get the flashable cartridge. ;)

Blog blog blog 2023

I resolve this year to publish more long-form blog posts with whatever cool stuff I’m working on, for work or for fun.

I’m trying to treat social media as more ephemeral. I quit Twitter entirely last year, deleting the account; my mastodon.technology account has vanished with the server shutting down, and I’ve even set my new account to delete non-bookmarked posts after two weeks.

It’s fun to talk about my projects a couple hundred characters at a time, but it’s also really nice to put together a bigger post that can take you through something over time and collect all the pieces together.

A long-form blog, with updateable pages, allows for this, and I think makes for a better experience when you really *do* mean to publish something informative or interesting. Let’s bring embloggeration back!

AudioFeeder updates for ogv.js

I’ve taken a break from the blog for too long! Time to update some on current work. We’re doing a final push on the video.js-based frontend media player for MediaWiki’s TimedMediaHandler, with some new user interface bits, better mobile support, and laying the foundation for smoother streaming in the future.

Among other things, I’m doing some cleanup on the AudioFeeder component in the ogv.js codec shim, which is still used in Safari on iOS devices and older Macs.

This abstracts a digital sound output channel with an append-only buffer, which can be stopped/started, the volume changed, and the current playback position queried.

When I was starting on this work in 2014 or so, Internet Explorer 11 was supported so I needed a Flash backend for IE, and a Web Audio backend for Safari… at the time, the only way to create the endless virtual buffer in Web Audio was using a ScriptProcessorNode, which ran its data-manipulation callback on the main thread. This required a fairly large buffer size for each callback to ensure the browser’s audio thread had data available if the main thread was hung up on drawing a video frame or something.

Fast forward to 2022: IE 11 and Flash are EOL and I’ve been able to drop them from our support matrix. Safari and other browsers still support ScriptProcessorNode, but it’s been officially deprecated for a while in favor of AudioWorklets.

I’ve been meaning to look into upgrading with an AudioWorklet backend but hadn’t had need; however I’m seeing some performance problems with the current code on Safari Technology Preview, especially on my slower 2015 MacBook Pro which doesn’t grok VP9 in hardware so needs the shim. :) Figured it’s worth taking a day or two to see if I can avoid a perf regression on old Macs when the next Safari update comes out.

So first — what’s a worklet? This is an interface that’s being adopted by a few web bits (I think some CSS animation bits are using these too) to have a fairly structured way of loading little scripts into a dedicated worker thread (the worklet) to do specific things off-main-thread that are performance critical (audio, layout, animation).

An AudioWorkletNode hooks into the Web Audio graph, giving something similar to a ScriptProcessorNode but where the processing callback runs in the worklet, on an AudioWorkletProcessor subclass. The processor object has audio-specific stuff like the media time at the start of the buffer, and is given input/output channels to read/write.

For an ever-growing output, we use 0 inputs and 1 output; ideally I can support multichannel audio as well, which I never bothered to do in the old code (for simplicity it downmixed everything to stereo). Because the worklet processors run on a dedicated thread, the data comes in small chunks — by default something like 128 samples — whereas I’d been using like 8192-sample buffers on the main thread! This allows you to have low latency, if you prefer it over a comfy buffer.

Communicating with worker threads traditionally sucks in JavaScript — unless you opt into the new secure stuff for shared memory buffers you have to send asynchronous messages; however those messages can include structured data so it’s easy to send Float32Arrays full of samples around.

The AudioWorkletNode on the main thread gets its own MessagePort, which connects to a fellow MessagePort on the AudioWorkletProcessor in the audio thread, and you can post JS objects back and forth, using the standard “structured clone” algorithm for stripping out local state.

I haven’t quite got it running yet but I think I’m close. ;) On node creation, an initial set of queued buffers are sent in with the setup parameters. When audio starts playing, after the first callback copies out its data it posts some state info back to the main thread, with the audio chunk’s timestamp and the number of samples output so far.

The main thread’s AudioFeeder abstraction can then pair those up to report what timestamp within the data feed is being played now, with compensation for any surprises (like the audio thread missing a callback itself, or a buffer underrun from the main thread causing a delay).

When stopping, instead of just removing the node from the audio graph, I’ve got the main thread sending down a message that notifies the worklet code that it can safely stop, and asking for any remaining data back. This is important if we’ve maintained a healthy buffer of decoded audio; in case we continue playback from the same position, we can pass the buffers back into the new worklet node.

I kinda like the interface now that I’m digging in it. Should work… Either tonight or tomorrow hope to sort that out and get ogv.js updated in-tree again in TMH.

HDR to SDR tone-mapping

I’ve been playing a lot of Flight Simulator lately, and when I acquired a monitor with basic high dynamic range (HDR) capability, thought it might be fun to try out. Little did I know it would launch me into a world of image processing and color spaces…

First, what is HDR? And what is SDR? Standard dynamic range images are optimized for a fairly small dynamic range of minimum to maximum brightness. The sRGB color space, standard for most computer stuff these days, is specified in optimal conditions for a maximum screen brightness of 80 nits (1 nit == 1 candela per square meter) in a darkened room, though most people’s desktops are much brighter for daylight conditions.

High dynamic range images can have much higher brightnesses, while (hopefully) still maintaining good detail in darker regions of the image. The common HDR10 pixel format used for HDR video allows for a maximum luminance of 10,000 nits — 125 times the brightness of a standard-calibrated SDR signal! Common displays may be much more limited though — my monitor is rated as DisplayHDR 400, which provides a maximum brightness of just 400 nits (5 times the SDR standard). This is still plenty to show brighter whites and colors, and is actually really nice for Flight Simulator where bright daylight and dark shadows and interiors coexist all the time.

However now that I’m flying in HDR and taking screenshots of my simulated adventures, how do I share those photos with everyone with a normal monitor, in file formats that social media platforms support?

Naturally, I decided that converting files one-off with a viewer app I found wasn’t good enough, and wrote my own utility I can use for batch-processing. ;) Once cleaned up, this can also become useful for Wikipedia to render SDR thumbnails of HDR images (once we confirm which formats we can support without problems).

To illustrate how tone-mapping and mapping of out of gamut colors affects the rendering, I’ve taken a particularly dramatic screenshot from an early morning flight, at sunrise. See also original file in JPEG XR format.

If we just clip the brighter colors into SDR range, the entire sky is completely blown out:

Or if we drop the exposure a few stops to optimize for the brightest colors, we can’t see anything but the sunrise:

To map the wide range of input into the [0, 1] range, we need some non-linear operator that preserves most of the detail in the low end untouched, then squishes brighter stuff into the top end with some loss of contrast.

A common HDR to SDR tone-mapping operator is the Reinhard algorithm; where C is the input value and C_white is the maximum value to be preserved:

TMO(C) = C(1 + C/C_white²)/(1 + C)
Reinhard et al, 2009

If you apply this separately to the input Red, Green, and Blue channels, you end up with a result that isn’t displeasing, but causes a lot of color shifts as the color elements don’t scale at the same rates… in this case, the orange areas of the sky become much more yellow than they should be. There’s also a lot of desaturation of brighter areas, much more than I like personally:

If instead we apply the operator in the luminance domain, we can preserve colors more exactly. However there’s a big problem, which is that a pixel’s luminance (brightness) may be much lower than the maximum of its components! For instance a deep orange will have a very high red, a more modest green, and a much more modest blue. When we map the resulting colors into the output, the red clips at maximum before the green does, causing bright sky oranges to shift towards yellow and lose contrast:

One possibility is to map those too-bright colors back into gamut by progressively desaturating them. For both luminance and saturation changes I’m using the Oklab color space, which is similar to CIELUV and is designed to make it easy to scale and transition colors maintaining perceptual qualities. If I apply just enough desaturation to keep every pixel’s Red, Green, and Blue elements in gamut, I lose some color in the brightest parts of the image but it packs the full punch of the brightness of the sunrise:

Which one’s right? There’s no one right answer. But when you’re batch processing you gotta pick a default, and I kinda like this last one. ;) It maintains the luminance data, which is most important to the human visual system, and though it loses the pure color of the sun and immediate area of the sunrise, it keeps the surrounding area much better than my other versions so far.

So what would we need to support these sorts of images on Wikipedia? A few things to consider:

First, actual file formats are important!

My screenshots are saved by the NVIDIA game capture tool in JPEG XR (a Microsoft flavored standard, which may or may not have patent issues but should be covered by their open source patent license covenant because they released a codec library for it). If patents aren’t a problem, it’s easy enough to use that library directly or indirectly.
I assume HDR can be done in HEIC/HEIF which is based on HEVC, the codec my NVIDIA tool captures videos in.
AVIF is the open media / Google-flavored variant of HEIF based on AV1 codec instead of HEVC. Should be no problems for patents from our perspective. I hear there may be browser support in Chrome at least, but have not tested this either yet.
OpenEXR is a more classic HDR file format for photography and cinema production usage. I don’t know the patent state, but it’s implemented by widely used open source tools.
For video, VP9 should be fine and AV1 will work later, but we’ll need more complications in the pipeline to deal with transcoding SDR and HDR variants!

Second, rendering regular SDR thumbnails for browsers that don’t grok them natively or don’t know how to tone-map well: we could probably adapt the utility I wrote to do this as a filter that we can plug into Thumbor. The code’s written in Rust as a CLI utility, and runs cross-platform. Could be adapted to take raw data on stdin/stdout or call as a library.

Third, interactive browser display. Whether on an SDR or HDR monitor it would often be nice to be able to adjust exposure in the viewer, which necessitates being able to do the tone-mapping in real-time; this would be best done in WebGL with a shader, rather than something silly like compiling my Rust code to WebAssembly. :)

And then that would have to get integrated into MediaViewer, with suitable mobile and desktop interfaces if necessary.

If we actually want to display HDR thumbnails inline — well that’s another fun thing! AVIF would be the main target, I think, but I don’t know what the status of support is in browsers yet (both for the format, and for HDR specifically).

We might also want the thumbnail & initial display on zoom to be able to set an exposure multiplier, or even specify whether to use tone-mapping or clip the range, as image parameters in the wiki page.

All fun possibilities that need to be decided on and taken into account some time. :)

Modularity and cross-language projects: a nostalgic look

Before I was paid to work on code for a living, it was my hobby. My favorite project from when I was young, before The Internet came to the masses, was a support library for my other programs (little widgets, games, and utilities for myself): a loadable graphics driver system for VGA and SVGA cards, *just* before Windows became popular and provided all this infrastructure for you. ;)

The basics

I used a combination of Pascal, C, and assembly language to create host programs (mainly in Pascal) and loadable modules (linked together from C and asm code). I used C for higher-level parts of the drivers like drawing lines and circles, because I could express the code more easily than in asm yet I could still create a tiny linkable bit of code that was self-sufficient and didn’t need a runtime library.

High performance loop optimizations and BIOS calls were done in assembly language, directly invoking processor features like interrupt calls and manually unrolling and optimizing tight loops for blits, fills, and horizontal lines.

Driver model

A driver would be compiled with C’s “tiny” memory model and the C and asm code linked together into a DOS “.com” executable, which was the simplest executable format devisable — it’s simply loaded into memory at the start of a 64-KiB “segment”, with a little space at the top for command line args. Your code could safely assume the pointer value of the start of the executable within that segment, so you could use absolute pointers for branches and local memory storage.

I kept the same model, but loaded it within the host program’s memory and added one more convention: an address table at the start of the driver, pointing to the start of the various standard functions, which was a list roughly like this:

set mode
clear screen
set palette
set pixel
get pixel
draw horizontal line
draw vertical line
draw arbitrary line
draw circle
blit/copy

Optimizations

IIRC, a driver could choose to implement only a few base functions like set mode & set/get pixel and the rest would be emulated in generic C or Pascal code that might be slower than an optimized version.

The main custom optimizations (rather than generic “make code go fast”) were around horizontal lines & fills, where you could sometimes make use of a feature of the graphics card — for instance in the “Mode X” variants of VGA’s 256-color mode used by many games of the era, the “planar” memory mode of the VGA could be invoked to write four same-color pixels simultaneously in a horiz line or solid box. You only had to go pixel-by-pixel at the left and right edges if they didn’t end on a 4-pixel boundary!

SVGA stuff sometimes also had special abilities you could invoke, though I’m not sure how far I ever got on that. (Mostly I remember using the VESA mode-setting and doing some generic fiddling at 640×480, 800×600, and maybe even the exotic promise of 1024×768!)

High-level GUI

I built a minimal high-level Pascal GUI on top of this driver which could do some very simple window & widget drawing & respond to mouse and keyboard events, using the low-level graphics driver to pick a suitable 256-color mode and draw stuff. If it’s the same project I’m thinking of, my dad actually paid me a token amount as a “subcontractor” to use my GUI library in a small program for a side consulting gig.

So that’s the story of my first paying job as a programmer! :)

Even more nostalgia

None of this was new or groundbreaking when I did it; most of it would’ve been old hat to anyone working in the graphics & GUI program library industries I’m sure! But it was really exciting to me to work out how the pieces went together with the tools available to me at the time, with only a room full of Byte! Magazine and Dr Dobb’s Journal to connect me to the outside world of programming.

I’m really glad that kids (and adults!) learning programming today have access to more people and more resources, but I worry they’re also flooded with a world of “everything’s already been done, so why do anything from scratch?” Well it’s fun to bake a cake from scratch too, even if you don’t *have* to because you can just buy a whole cake or a cake mix!

The loadable drivers and asymmetric use of programming languages to target specific areas of work are *useful*. The old Portland Pattern Repository Wiki called it “alternating hard and soft layers”. 8-bit programmers called it “doing awesome stuff with machine language embedded in my BASIC programs”. Embedded machine code in BASIC programs you typed in from magazines? That was how I lived in the late 1980s / early 1990s my folks!

Future

I *might* still have the source code for some of this on an old backup CD-ROM. If I find it I’ll stick this stuff up on GitHub for the amusement of my fellow programmers. :)

Windows ARM64 Visual Studio Code Insiders now available

Just want to give a shout-out to the wonderful folks at Microsoft and elsewhere who have gotten a Visual Studio Code Insiders build created for Windows on ARM64, which runs natively on the Surface Pro X and other ARM64 machines.

It’s still not listed in the regular downloads but it works for me when installed directly, and should auto-update with further Insiders releases. :)

The x86 build ran acceptably for some light development on the Surface Pro X in emulation, but the native build feels a *lot* faster. Starts up instantly, no longer so sluggish to scroll or wait for linter updates.

Now all I need is Docker for Win10/ARM64 and for WSL2 to fix the ARM64 performance problems with Hyper-V. :)

Surface Pro X thoughts after a few weeks

After a few weeks using the fancy new Windows 10 ARM64 tablet, the Surface Pro X, I’ve got a few thoughts. Mostly good so far, but it remains an early adopter device with a few rough edges (virtually — the physical edges are smooth and beautiful!) Note that my use cases are not everyone’s use cases, so some people will have even more luck, or even less luck, getting things working. :) Your mileage can and will vary.

Hardware

It’s just gorgeous. Too gorgeous. It’s all black-on-black labeled with black type. Mostly this is fine, but I find it hard to find the USB-C ports on the left side when I’ve got it propped up on its stand. :)

Seriously though, my biggest hardware complaint is that the bezels are too small for its size when holding as a tablet in the hands — I keep hitting the corners with my fat hands and opening the start menu or closing an app. I’m still not 100% sold on the idea of tablets-with-keyboards at this size (13″ diagonal or so).

But for watching stuff, the screen is *fantastic*. The 3:2 aspect ratio is also much better for anything that’s not video, while still not feeling like I’ve wasted much space on a 16:9 letterbox.

The keyboard attachment is pretty good. Get it. GET IT. I got the one that also has the cradle for the pen, which I never use but felt like I had to try out. If I did more art I would probably use it.

Performance and emulation

The CPU is really good. It’s got a huge speed boost over the Snapdragon 835 and 850 in older ARM64 Windows machines, and feels very snappy in native apps like Firefox or the new Edge. With 4 high-power CPU cores and 4 low-power cores, it handles multithreaded workloads fairly well unless they get confused by the scheduler… I’ve sometimes seen things have background threads get pushed to the low-power cores where they take a long time to run.

(In Task Manager, you can see the first 4 cores are the low-power cores, the next 4 are high-power.)

x86 Windows software is supported via emulation, both for store apps and regular win32 apps you find anywhere. But not everything works. I’ve generally had good luck with tools and applications – Visual Studio, VS Code, Chrome, Git for Windows, Krita, Inkscape all run. But about 1/2 of the Steam games I tried failed to run, maybe more. And software that’s x64-only won’t run at all, as there’s no emulator support for 64-bit code.

Emulated code in my unscientific testing runs 2-3 times slower than native code on sustained loops, but you can expect loading-time stuff to be slower because things have to get traced/compiled on the first run through or when code is modified in memory.

Nonetheless, 2-3 times slower than really-fast is still not-bad, and for UI-heavy or i/o-heavy applications it’s not too significant. I’ve had no real complaints using the x86 VS Code front-end, but more complaints with, say, compiling things in Visual Studio. :)

Web use case

Most of what I use a computer for these days is in a web browser environment, so “using the web” is big. Firefox has an optmized, native ARM64 build. Works great. ’nuff said.

Oh also Edge preview builds in the Dev and Canary channel are ARM64 native and run great, if you like that sort of thing.

Chrome, however, has not released a native build and will run in x86 emulation. If you need Chrome specifically it *will install and run* but it will be slow. Do not grab custom Chromium builds unless you’re using them only for testing, as they will not be secure or get updated!

Developer use case

I’m a software developer, so in addition to “everything that goes in a web browser” I need to use tools to work on a combination of stuff, mostly:

PHP and client-side JavaScript code (MediaWiki, a few other bits)
weird science C / JavaScript / emscripten / WebAssembly stuff (ogv.js, which plugs into MediaWiki’s video player extension)
research work in Rust (mtpng threaded PNG compressor)

LAMP stuff

I’m used to working in either a macOS or Linux environment, with Unix-like command line tools and usually a separate GUI text editor like Visual Studio Code, and never had good experiences trying to run the MediaWiki LAMP-stack tools on a Windows environment in years past. Even with Vagrant managing a VM, it had proved more fragile on Windows for me than on Mac or Linux.

WSL (Windows Subsystem for Linux) has changed that. I can run a Debian or Ubuntu system with less overhead and better integration to the host system than running in a traditional VM like VirtualBox or Hyper-V. On the Surface Pro X, you get the aarch64 distribution of Ubuntu or Debian (or whatever other supporting distro you choose to install) so it runs full speed, with no emulation overhead.

I’ve been using a MediaWiki git checkout in an Ubuntu setup, using the standard PHP/Apache/MySQL/whatevers and manually running git & composer updates. The main downside to using WSL here is that services don’t get started automatically because it doesn’t run the traditional init process, but “service mysql start” etc works as expected and gets you working.

For editing, I use Visual Studio Code. This is not yet available as an ARM64 optimized build (the x86 frontend runs in emulation), but does in 1.41 now include ARM64 support for WSL integration — which means you can run the PHP linter on your code running inside the Linux environment while your editor frontend is a native Windows GUI app. No wacky X11 hacks required.

emscripten stuff

The emscripten compiler for WebAssembly stuff works great, but doesn’t ship ARM or ARM64 builds for any platform yet in the emsdk tool.

You can build manually from source for now, and hopefully I can get builds working from the emsdk installer too (though you still would have to run the build yourself).

The main annoyance I had was that Ubuntu LTS currently ships an old node.js, which I had to supplement with a newer build to get my environment the way I wanted it for my scripts. :) This was pretty straightforward.

Rust stuff

Rust includes support for building code for Windows ARM64 — it has to to support things like Firefox! — but the compiler & tools distribution comes as x86. I’m sure this will eventually get worked out, but for now if you install Rust on Windows you’ll get the x86 build and may have to manually add the aarch64 target. But it does work — I can compile and run my mtpng project for Windows 10 ARM64 on the device.

Within a WSL environment, you can install Rust for Linux aarch64 and it “just works” as you’d expect, as well.

Final notes

All in all, pretty happy with it. I might have preferred a Surface Laptop X with similar specs but a built-in keyboard, but at a desk or other …. “surface” … it works fine for typey things like programming.

Certainly I prefer the keyboard to the keyboard on my 2018 MacBook Pro. ;)

Overflowing stacks in WebAssembly, whoops!

Native C-like programming environments tend to divide memory into several regions:

static data contains predefined constants loaded from the binary, and space for global variables (“data”)
your actual code has to go in memory too! (“text”)
space for dynamic memory allocation (“heap”), which may grow “up” as more memory is needed
space for temporary data and return addresses for function calls (“stack”), which may grow “down” as more memory is needed

Usually this is laid out with the stack high in the address space and the heap lower in the address space, if I recall correctly? Allocating more heap is done when you need it via malloc, and the stack can either grow or warn of an overflow by using the CPU’s memory manager to detect use of data pages incremented beyond the edge of the stack.

In emscripten’s WebAssembly porting environment, things are similar but a little different:

code doesn’t live in linear memory, so functions don’t have memory addresses
because code return addresses and small local variables also live separately, only arrays/structs and vars with address taken must be on stack.
usable memory is continguous; you can’t have a sparse address space where stack and heap can both grow.

As a result, the stack is fixed size and there’s some fragility, but the stack uses less space usually.

Currently the memory layout is to start with static data, follow with the stack, and then the heap. The stack grows “down”, meaning when you reach the end of the stack you end up in static data territory and can overwrite global variables. This leads to weird, hard to detect errors.

When making debug builds with assertions there are some “cookie” checks to ensure that some specific locations at the edge of the stack have not been overwritten at various times, but this doesn’t always catch things if you only overwrote the beginning of a buffer in static data and not the part that had the cookie. :) It also doesn’t seem to trigger on library workflows where you’re not running through the emscripten HTML5/SDL/WebGL runtime.

There’s currently a PR open to reduce the default stack size from 5 MiB to 0.5 MiB, which reduces the amount of memory needed for small modules significantly, and we’re chatting a bit about detecting errors in the case that codebases have regressions…

One thing that’s come up is the possibility of moving the stack to before static data, so you’d have: stack, static data, heap.

This has two consequences:

any memory access beyond the stack end will wrap around 0 into unallocated memory at the top of the address space, causing an immediate trap — this is done by the memory manager for free, with no false positives or negatives
literal references to addresses in static data will be larger numbers, thus may take more bytes to encode in the binary (variable-length encoding is used in WebAssembly for constants)

Probably the safety and debugging win would be a bigger benefit than the size savings, though potentially that could be a win for size-conscience optimizations when not debugging.