WebM seeking coming in ogv.js 1.1.2

Seeking in WebM playback will finally be supported in ogv.js 1.1.2. Yay! Try it out!

I seeked in this WebM file! Yeah really!

This took some fancy footwork, as the demuxer library I’m using (nestegg) only provides a synchronous i/o-using interface for seeking: on the first seek, it needed to be able to first do a seek to the location of the cues in the file (they’re usually at the end in WebM, not at the beginning!), then read in the cue data, and then run a second seek to the start of a cluster.

On examining the innards of the library I found that it’s fairly safe to ‘restart’ the operation directly after the seek attempts, which saved me the trouble of patching the library code; though I may come up with a patch to more cleanly & reliably do this.

For the initial hack, I have my i/o callbacks detect that an attempt was made to seek outside the buffer range, and then when the nestegg library function fails out, the demuxer sees the seek attempt and passes it back to the caller (OGVPlayer) which is able to trigger a seek on the actual, asynchronous i/o layer. Once the new data starts arriving, we call back into the demuxer to read the cues or confirm that we’ve seeked to the right place and can continue decoding. As an additional protection against the library freaking out, I make sure that the entire cues element has been buffered before attempting to read it.

I’ve also fixed a bug that caused WebM playback to occasionally die with an out of memory error. This one was caused by not freeing packet data in the demuxer. *headdesk* At least that bug was easier. 😉

This gets WebM VP8 playback working about as well as Ogg Theora playback, except for the decoding being about 5x slower! Well, I’ve got plans for that. 😀

VP8 parallelization continued

Following up on earlier musings… Found an interesting analysis by someone who did some work in this area a few years ago.

image

Based on the way data flows from  macroblocks adjacent above or to the left, it is safe to parallelize multiple super blocks in diagonal lines running from top-right towards the bottom-left. This means you start with a batch of one macroblock, then a batch of two, then three…. Up to the maximum diagonal dimension (30 blocks at 480p or 68 at 1080p) then back down again to one at the far bottom right corner. Hmm, not exactly diagonal, actually half diagonal?

image

(Based on my reading the filter stage doesn’t need to access the above-right block which simplifies things, but I could be wrong… Intraframe prediction looks scarier though and may need the different angle cut. Anyway the half diagonal order should work for the entire set of all operations…).

This breakdown should be suitable both for CPU worker-based threading (breaking the batches into smaller sub-batches per core) and for WebGL shader based GPU work (where each large batch would issue several draw calls, each processing up to the full batch size of macroblocks).

For CPU work there could also be pipelining between the data stages, though if doing full batching that may not be necessary.

Data locality and latency

On modern computers, accessing memory is often the slowest part of a naively written calculation. If the memory you’re working with isn’t in cache, fetching it can be verrrry slow. And if you need to transfer data between the CPU and GPU things get even nastier.

Unpacking the entropy-coded data structures for an entire frames worth of macroblocks then processing them in a different order sounds like it might be expensive. But it shouldn’t be insanely so; most processing will be local to each block.

For GPU usage, data also flows mostly one way from the CPU into textures and arrays uploaded into the GPU, where random texel access should be fast. We shouldn’t need to read any of that data back to the CPU until the end of the loop filter, when we read the YUV buffers back to feed into the next frame’s predictions and output to the player.

What we do need for the GPU is to read the results of each batch’s computations back into the next batch. As long as I can render into a texture and use that texture as a source for the next call I think that all works as I need it.

Loop filter madness

The loop filter stage reduces artifacting at the edges of macroblocks and sub-blocks from motion prediction and DCT fun. Because it’s very precisely specced, and the output feeds back into the next frame’s predictions, it’s important to get this right.

per spec, each macroblock may apply up to four filter passes:

left edge

subblock left edges

top edge

subblock top edges

Along each edge, for each pixel along the edge the boundaries are tested for a threshold and a filter may or may not be applied, variously to one, two, or three pixels deep from the edge. For the macroblock edges, when the filter applies it also modifies the adjacent block data!

It’s all pretty funky, but it looks parallelizable within limits.

Ah, now to find time to research this further while still getting other stuff done. 😉

Peeking into VP8 video decoding performance

The ogv.js distribution includes Ogg Theora video decoding, which we use on Wikipedia to play back our media files in Safari, IE, and Edge, but also has an experimental mode for WebM files with VP8 video.

WebM/VP8 is better supported by many tools, and provides higher quality video at lower bandwidth usage than Ogg Theora… There are two major reasons I haven’t taken WebM out of “experimental” status:

  1. The demuxer library (nestegg) was hacked in quickly and I need to add support for seeking, “not crashing”, etc
  2. Decoding VP8 video is several times slower than decoding Theora at the same resolution.

The first is a simple engineering problem; I can either hack up nestegg or replace it with something that fits the i/o model I’m working with better.

The second is an intrinsic complexity problem: although the two formats are broadly similar technologies, VP8 requires more computation to decode a frame  than Theora does.

Add to this the fact that we have some environmental limitations on common CPU optimizations for parallelizable code in the C libraries:

  • JavaScript “Worker” threads are different from the low-level pthreads threading model used by C code, and the interfaces required to more closely emulate it are not yet available in Safari or Edge.
  • SIMD (“Same Instruction Multiple Data”) processing is not available in Safari, and not yet production-enabled in Edge.

So, if we can’t use SIMD instructions to parallelize tiny bits of processing, and we can’t simply crank up multithreading to use a now-ubiquitous second CPU core, how can we split up the work?

The first step, I think, is to determine some data boundaries where we might hope to be able to export data from the libvpx library before it’s fully done processing.

What’s what

The VP8 decoder & bitstream format is defined in RFC 6386, from which we can roughly divide decoding into four stages:

  1. …decode input stream entropy encoding…
  2. reconstruct base signal by applying motion vectors against a reference frame
  3. reconstruct residual signal by applying inverse DCT
  4. apply loop filter

The entropy encoding isn’t really a separate stage, but happens as other data gets read in and then fed into each other stage. It’s not really parallelizable itself either.

Aiming high

So before I go researching more on trying to optimize some of these steps, what’s actually the biggest, slowest stuff to work on?

I did a quick profiling run playing back some 480p WebM in Chrome (pretty good compiler, and the profiling tools support profiling just one worker thread which isolates the VP8 decoder nicely). Broke down the resulting function self-time list by rough category and found:

Filter: 54.40%
Motion: 22.75%
IDCT: 10.55%
Other: 12.31%

(“Other” included any function I didn’t tag, as well as general overhead.)

Ouch! Filtering looks like a good first application of a separate step.

Possible directions – staying on CPU

If we stick with the CPU, we can create further Worker threads and send blocks of data to be filtered. In many cases, even when processing of one macroblock to another is hard to parallelize because of data dependencies, there is natural parallelism in the color system — the Y (“luma”, or brightness) plane can be processed by one thread while the U and V (“chroma”, or color) planes can be processed independently by another.

However splitting processing between luma and chroma has a limited benefit. There’s twice as much luma data as chroma, so you save at most 33% of the runtime here.

Macroblocks and subblocks

VP8 divides most processing & data into “macroblocks” of 16×16 pixels, composed of 24 “subblocks” of 4×4 pixels (that’s 16 subblocks of luma data, plus 4 each from the two lower-resolution chroma planes).

In many cases we can parallelize at the subblock level, which divides 24 evenly into 2 cores, or even 4 or 8 cores! But the most performance-sensitive devices will usually only have 2 CPU cores available, giving us only a 50% potential speedup.

Going crazy – GPU time?

Graphics processors are much more aggressively multithreaded, with multiple cores and ability to handle many more simultaneous operations.

If we run more operations on the GPU, we might actually have a chance to run 24 subblocks all in parallel!

However there’s a lot more fuzziness involved in getting there:

  • may have to rewrite code from C into GLSL
  • have to figure out how to operate as fragment or vertex shaders
  • have to figure out how to efficiently get data in/out
  • oh and you can’t use WebGL from a Worker thread, so have to bounce data up to the main thread and back down to the codec
  • once all that’s done, what’s the actual performance of the GPU vs the CPU per-operation? no idea 😀

So, there would seem to be great scaling potential on the GPU side, but there are a lot of unknowns.

Worth investigating but we’ll see what’s really feasible…

 

1000fps no more

Different media file formats encode things like time and frame rates differently… or not at all.
 
WebM doesn’t list a frame rate; each frame is simply given a position in time. Meanwhile the older Ogg Theora codec defines a consistent, pre-defined frame rate for a stream, but allows frames to be declared as duplicates of the previous frame as an optimization.
 
At the intersection of these two, some files auto-converted from WebM to Ogg on Wikimedia’s servers end up claiming to encode a “1000 fps” video stream, where nearly all the frames are dupes and there’s actually ~25-30 or at most 60 actual frames per second.
 
I had to put a hack into my ogv.js player to handle these, because actually trying to draw 1000 frames per second was kind of slow. 😉
https://github.com/brion/ogv.js/issues/349

ogv.js 1.1.0 alpha now on npm

ogv.js 1.1.0-alpha.0 is now available for download:

Big thanks to Stephan Hesse who retooled large chunks of the build system using webpack, which brought us a lot closer to the npm package release.

ogv.js 1.1.0 is a drop-in update to 1.0; many internal classes are no longer leaked to global namespace, but the public OGV* classes remain as they are.

The internal AudioFeeder class is also available as a separate npm audio-feeder package; more internal classes will follow including the streaming URL reader and the WebGL-accelerated YUV canvas.

In addition to internal/build changes, this release has major fixes for seeking in ogg files, implements the volume property, and adds support for more properties and events (not yet 100% up to spec, but closer).

After a few more days shaking this out I’ll push it up to MediaWiki’s TimedMediaHandler extension, where it’ll make it to Wikipedia and Wikimedia Commons.

Windows 10 Insider build 14316 supports VP9 & Opus in WebM (sorta)

On the ‘Insider’ build 14316 of Windows 10, WebM VP9+Opus video files can be played in Microsoft Edge! Well, indirectly via Media Source Extensions, but still. 🙂

Try it out! https://brionv.com/misc/msetest/ (primitive demo)

You may have to manually enable VP9 in about:flags however, as it defaults to on only if hardware-accelerated, and I have no idea which GPU drivers support that on Windows yet.

flags

It’d be great if HTMLMediaElement and MediaSource’s methods for checking playback support could also allow checking the hw-acceleration status — especially on mobile, it’s often preferable to use a codec that’s hardware-accelerated (like H.264) to use less battery life even if you would get better visual quality from a more advanced codec (like VP9).

On the other hand in Wikipedia’s case, we don’t use H.264 for patent license issues, so our alternative to software VP9 decoding is JavaScript Theora or VP8 decoding, which is going to be harder on the CPU than nicely tuned native code for VP9 decoding.

Alliance for Open Media code drop & more hardware partners

Very exciting! The new video codec so far is mostly based on Google’s in-development VP10 (next gen of the VP8/VP9 used in WebM) but is being co-developed and supported with a number of other orgs.

  • CPU/GPU/SoC makers: Intel, AMD, ARM, NVidia
  • OS & machine makers: Google, Microsoft, Cisco
  • Browser makers: Mozilla, Google, Microsoft
  • Content farms: Netflix, Google (YouTube)

Microsoft is also actively working on VP8/VP9 support for Windows 10, with some limited compatibility in preview releases.

As always, Apple remains conspicuously absent. 🙁

Like the earlier VP8/VP9 the patent licenses are open and don’t have the kind of weird clauses that have tripped up MPEGLA’s H.264 and HEVC/H.265 in some quarters. (*cough* Linux *cough* Wikipedia)

Totally trying to figure out how we can get involved at this stage; making sure I can build the codec in my iOS app and JavaScript shim environments will be a great start!

Web Modules for JS + asset files: why don’t you exist yet?

Ugh, there really needs to be a sane way to package a “web” module that includes both JavaScript code and other assets (images, fonts, styles, Flash shims, and other horrors).

I’m cleaning up some code from my ogv.js JavaScript audio/video player, including breaking out the audio output and streaming URL input as separate modules for reuse… The audio output class is mostly a bit of JavaScript, but includes a Flash shim for IE 10/11 which requires bundling a .swf file and making it web-accessible.

Browserify can package up a virtual filesystem into your JS code — ok for loading a WebGL shader maybe — but can’t expose it to the HTML side of things where you want to reference a file by URL, especially if you don’t want it preloaded.

Bower will fetch everything from the modules you specify, but doesn’t distinguish clearly between files that should be included in the output (or what paths will be used to reference them) and random stuff in your repo. They kinda recommend using further tools to distill down an actual set of output…

Dear LazyWeb: Is there anything else in the state of the art that can make it a little easier to bundle together this sort of module into larger library or app distributions?

Windows RT is dead, long live Windows 10

Some readers with niche interests may remember Windows RT, the unpopular now-discontinued flavor of Windows 8/8.1 for ARM-based tablets.

Windows RT shipped on the first two generations of Microsoft Surface tablets (the lower-end Surface RT and Surface 2, not the Intel-powered Surface Pro line), and a tiny handful of third-party ARM tablets.

Even though it exclusively ran on touchscreen tablets and couldn’t run any of the massive library of 3rd-party Win32 software, Windows RT had the same widely-hated hybrid touch/desktop interface that laptop and desktop users encountered on Windows 8. Microsoft’s inability to port its Office suite fully to the touchscreen environment in time for Windows 8/RT’s release particularly highlighted this divide, as you had to bounce into the desktop every time you pulled up Word or PowerPoint.

It was pretty frustrating to use, and pretty frustrating to develop for — since the desktop was included, there was no technical reason why Win32 application developers couldn’t recompile their apps for ARM processors (maybe with some touch-friendly UI enhancements) like Microsoft did with Office… but Windows RT, unlike Windows 8, was locked down to require code signing approval from Microsoft on both touch-style apps and Win32 apps.

So developers had to port their apps completely to the new sandboxed ‘Modern’ API subsets and package format in order to run on Windows RT at all… and on Windows 8/8.1 desktops, those apps would only run fullscreen, not in… you know… windows. Overwhelmingly developers chose not to jump through those hoops, and everybody lost.

When it was announced that there would be no upgrade path for the ARM-based Windows RT devices to Windows 10, that was the final nail in the coffin.

Or was it?

The true innovation of Windows RT is that it was Windows 8’s ARM flavor. The API surfaces were the same, the source code was the same, the security updates came out together, and the major 8.1 upgrade came out together. If you were one of the few who developed ‘Modern’ apps for Windows 8/8.1, you only had to not uncheck the ‘ARM’ box when building and deploying to the Windows Store to reach Windows RT users as well.

That it was exposed to end-users in the Windows 8 ‘weird hybrid interface’ stage, and without Win32 access for 3rd-party developers, is perhaps unfortunate. But then we can say that of Windows 8 as a whole. 😉

But what this experiment enabled was the replacement of the old Windows CE + weird .NET-based Windows Phone 7 with the Windows NT + weird .NET-based Windows Phone 8, followed finally by the Windows NT-native Windows 10 Mobile and Windows 10 IoT… and even the Xbox One, which is picking up more user-facing Windows 10 features with each major update.

Like Windows RT  & Windows 8, Windows 10’s Mobile/IoT editions are built from the same base as Windows 10 for Intel-based desktops/laptops/servers, just in ARM flavor. Application developers targeting the Windows 10 ‘Universal Windows Platform‘ can build and deploy the exact same binary package on all flavors, with compilation targeting the CPU architectures rather than the device types.

And unlike Windows RT & Windows 8, the various Windows 10 flavors actually tailor their user interfaces to their environments. 🙂

Windows 10 has also reduced the lock-in requirements a bit, making it possible for any end-user to relax the code-signing requirements for Universal Windows apps to install custom apps.

Now, it may be that Microsoft has dropped the ball and will never recover in the phone market. But having a common platform base between desktop/laptop, tablet, phone, TV/game, and ‘other device’ seems like it will be an advantage in enabling cross-development now that a subset of Surface-branded tablets isn’t the only reason to touch the ‘Modern’ APIs.

The real question is going to be how hard it is to get into the Universal Windows Platform ecosystem at all — once you’re in, handling multiple devices gets easier. Rumored tools to make it easier to port Win32 apps to run within UWP should help with this, as may things like the Objective-C bridge for devs with iOS codebases.

Google seems to be doing much the same with Android, trying to expand its Linux + weird Java system from phones & tablets to smartwatches and TVs/game consoles. It remains to be seen whether Android will also push out or merge with Chrome OS in Google’s laptop/desktop market.

In comparison, Apple is pursuing a slightly different strategy where various Darwin flavors (Mac OS X, iOS, tvOS, watchOS) are closely related but slightly different, requiring separate compilation and deployment for every flavor. You can usually build from the same source using the same tools, but it feels a lot more fiddly to me.

 

 

Windows 10 Objective-C bridge

While I was waiting for updates to download I finally took a look at Microsoft’s Objective-C ‘bridge’ for porting iOS apps to the Windows 10 platform.

It’s technically a pretty nice system; they’re using clang to build your Objective-C code natively to the Universal Windows Platform as native x86 or ARM code, with an open-source implementation of (large parts of) Apple’s Foundation and UIKit APIs on top of UWP. Since you’re building native code, you also have full access to the Windows 10 APIs if you want to throw in some ‪#‎ifdefs‬.

I suspect there are a lot fewer ‘moving parts’ in it than were in the ‘Project Astoria’ Android bridge (which would have had to deal with Java etc), so in retrospect it’s not super surprising that they kept this one while canceling the Android bridge.

I don’t know if it’ll get used much other than for games that targeted iOS first and want to port to Xbox One and Windows tablets easily, but it’s a neat idea.

Probably tricky to get something like the Wikipedia app working since that app does lots of WebView magic etc that’s probably a bad API fit… but might be fun to play with!