Acko.net

Occlusion with Bells On

2025-03-24T00:00:00+01:00

Modern SSAO in a modern run-time

Use.GPU 0.14 is out, so here's an update on my declarative/reactive rendering efforts.

The highlights in this release are:

dramatic inspector viewing upgrades
a modern ambient-occlusion (SSAO/GTAO) implementation
newly revised render pass infrastructure
expanded shader generation for bind groups
more use of generated WGSL struct types

SSAO with Image-Based Lighting

The main effect is that out-of-the-box, without any textures, Use.GPU no longer looks like early 2000s OpenGL. This is a problem every home-grown 3D effort runs into: how to make things look good without premium, high-quality models and pre-baking all the lights.

Use.GPU's reactive run-time continues to purr along well. Its main role is to enable doing at run-time what normally only happens at build time: dealing with shader permutations, assigning bindings, and so on. I'm quite proud of the line up of demos Use.GPU has now, for the sheer diversity of rendering techniques on display, including an example path tracer. The new inspector is the cherry on top.

A lot of the effort continues to revolve around mitigating flaws in GPU API design, and offering something simpler. As such, the challenge here wasn't just implementing SSAO: the basic effect is pretty easy. Rather, it brings with it a few new requirements, such as temporal accumulation and reprojection, that put new demands on the rendering pipeline, which I still want to expose in a modular and flexible way. This refines the efforts I detailed previously for 0.8.

Good SSAO also requires deep integration in the lighting pipeline. Here there is tension between modularizing and ease-of-use. If there is only one way to assemble a particular set of components, then it should probably be provided as a prefab. As such, occlusion has to remain a first class concept, tho it can be provided in several ways. It's a good case study of pragmatism over purity.

In case you're wondering: WebGPU is still not readily available on every device, so Use.GPU remains niche, tho it already excels at in-house use for adventurous clients. At this point you can imagine me and the browser GPU teams eyeing each other awkwardly from across the room: I certainly do.

Inspector Gadget

The first thing to mention is the upgraded the Use.GPU inspector. It already had a lot of quality-of-life features like highlighting, but the main issue was finding your way around the giant trees that Use.GPU now expands into.

Old

New

Highlights show data dependencies

The fix was filtering by type. This is very simple as a component already advertises its inspectability in a few pragmatic ways. Additionally, it uses the data dependency graph between components to identify relevant parents. This shows a surprisingly tidy overview with no additional manual tagging. For each demo, it really does show you the major parts first now.

If you've checked it out before, give it another try. The layered structure is now clearly visible, and often fits in one screen. The main split is how Live is used to reconcile different levels of representation: from data, to geometry, to renders, to dispatches. These points appear as different reconciler nodes, and can be toggled as a filter.

It's still the best way to see Live and Use.GPU in action. It can be tricky to grok that each line in the tree is really a plain function, calling other functions, as it's an execution trace you can inspect. It will now point you more in the right way, and auto-select the most useful tabs by default.

The inspector is unfortunately far heavier than the GPU rendering itself, as it all relies on HTML and React to do its thing. At some point it's probably worth to remake it into a Live-native version, maybe as a 2D canvas with some virtualization. But in the mean time it's a dev tool, so the important thing is that it still works when nothing else does.

Most of the images of buffers in this post can be viewed live in the inspector, if you have a WebGPU capable browser.

SSAO

Screen-space AO is common now: using the rendered depth buffer, you estimate occlusion in a hemisphere around every point. I opted for Ground Truth AO (GTAO) as it estimates the correct visibility integral, as opposed to a more empirical 'crease darkening' technique. It also allows me to estimate bent normals along the way, i.e. the average unoccluded direction, for better environment lighting.

Hemisphere sampling

This image shows the debug viz in the demo. Each frame will sample one green ring around a hemisphere, spinning rapidly, and you can hold ALT to capture the sampling process for the pixel you're pointing at. It was invaluable to find sampling issues, and also makes it trivial to verify alignment in 3D. The shader calls printPoint(…) and printLine(…) in WGSL, which are provided by a print helper, and linked in the same way it links any other shader functions.

Bent normal and occlusion samples

SSAO is expensive, and typically done at half-res, with heavy blurring to hide the sampling noise. Mine is no different, though I did take care to handle odd-sized framebuffers correctly, with no unexpected sample misalignments.

It also has accumulation over time, as the shadows change slowly from frame to frame. This is done with temporal reprojection and motion vectors, at the cost of a little bit of ghosting. Moving the camera doesn't reset the ambient occlusion, as long as it's moving smoothly.

Motion vectors example

Accumulated samples

As Use.GPU doesn't render continuously, you can now use to decide how many extra frames you want to render after every visual change.

Reprojection requires access to the last frame's depth, normal and samples, and this is trivial to provide. Use.GPU has built-in transparent history for render targets and buffers. This allows for a classic front/back buffer flipping arrangement with zero effort (also, n > 2).

Depth history

You bind this as virtual sources, each accessing a fixed slot history[i], which will transparently cycle whenever you render to its target. Any reimagined GPU API should seriously consider buffer history as a first-class concept. All the modern techniques require it.

IGN

Rather than use e.g. blue noise and hope the statistics work out, I chose a very precise sampling and blurring scheme. This uses interleaved gradient noise (IGN), and pre-filters samples in alternating 2x2 quads to help diffuse the speckles as quickly as possible. IGN is designed for 3x3 filters, so a more specifically tuned noise generator may work even better, but it's a decent V1.

Reprojection often doubles as a cheap blur filter, creating free anti-aliasing under motion or jitter. I avoided this however, as the data being sampled includes the bent normals, and this would cause all edges to become rounded. Instead I use a precise bilateral filter based on depth and normal, aided by 3D motion vectors. This means it knows exactly what depth to expect in the last frame, and the reprojected samples remain fully aliased, which is a good thing here. The choice of 3D motion vectors is mainly a fun experiment, it may be an unnecessary luxury.

Detail of accumulated samples

The motion vectors are based only on the camera motion for now, though there is already the option of implementing custom motion shaders similar to e.g. Unity. For live data viz and procedural geometry, motion vectors may not even be well-defined. Luckily it doesn't matter much: it converges fast enough that artifacts are hard to spot.

The final resolve can then do a bilateral upsample of these accumulated samples, using the original high-res normal and depth buffer:

Upscaled and resolved samples, with overscan trimmed off

Because it's screen-space, the shadows disappear at the screen edges. To remedy this, I implemented a very precise form of overscan. It expands the framebuffer by a constant amount of pixels, and expands the projectionMatrix to match. This border is then trimmed off when doing the final resolve. In principle this is pixel-exact, barring GPU quirks. These extra pixels don't go to waste either: they can get reprojected into the frame under motion, reducing visible noise significantly.

In theory this is very simple, as it's a direct scaling of [-1..1] XY clip space. In practice you have to make sure absolutely nothing visual depends on the exact X/Y range of your projectionMatrix, either its aspect ratio or in screen-space units. This required some cleanup on the inside, as Use.GPU has some pretty subtle scaling shaders for 2.5D and 3D points and lines. I imagine this is also why I haven't seen more people do this. But it's definitely worth it.

Overall I'm very satisfied with this. Improvements and tweaks can be made aplenty, some performance tuning needs to happen, but it looks great already. It also works in both forward and deferred mode. The shader source is here.

Render Buffers & Passes

The rendering API for passes reflects the way a user wants to think about it, as 1 logical step in producing a final image. Sub-passes such as shadows or SSAO aren't really separate here, as the correct render cannot be finished without it.

The main entry point here is the component, representing such a logical render pass. It sits inside a view, like an , and has some kind of pre-existing render context, like the visible canvas.

...

You can sequence multiple logical passes to add overlays with overlay: true, or even merge two scenes in 3D using the same Z-buffer.

Inside it's a declarative recipe that turns a few flags and options into the necessary arrangement of buffers and passes required. This uses the alt-Live syntax use(…) but you can pretend that's JSX:

const resources = [
  use(ViewBuffer, options),
  lights ? use(LightBuffer, options) : null,
  shadows ? use(ShadowBuffer, options) : null,
  picking ? use(PickingBuffer, options) : null,
  overscan ? use(OverscanBuffer, options) : null,
  ...(ssao ? [
    use(NormalBuffer, options),
    use(MotionBuffer, options),
  ] : []),
  ssao ? use(SSAOBuffer, options) : null,
];

const resolved = passes ?? [
  normals ? use(NormalPass, options) : null,
  motion ? use(MotionPass, options) : null,
  ssao ? use(SSAOPass, options) : null,
  shadows ? use(ShadowPass, options) : null,
  use(DEFAULT_PASS[viewType], options),
  picking ? use(PickingPass, options) : null,
  debug ? use(DebugPass, options) : null,
]

e.g. The will spawn all the buffers necessary to do SSAO.

Notice what is absent here: the inputs and outputs. The render passes are wired up implicitly, because if you had to do it manually, there would only be one correct way. This is the purpose of separating the resources from the passes: it allows everything to be allocated once, up front, so that then the render passes can connect them into a suitable graph with a non-trivial but generally expected topology. They find each other using 'well-known names' like normal and motion, which is how it's done in practice anyway.

Render passes in the inspector

This reflects what I am starting to run into more and more: that decomposed systems have little value if everyone has to use it the same way. It can lead to a lot of code noise, and also tie users to unimportant details of the existing implementation. Hence the simple recipe.

But, if you want to sequence your own render exactly, nothing prevents you from using the render components à la carte: the main method of composition is mounting reactive components in Live, like everything else. Your passes work exactly the same as the built-in ones.

I make use of the dynamicism of JS to e.g. not care what options are passed to the buffers and passes. The convention is that each should be namespaced so they don't collide. This provides real extensibility for custom use, while paving the cow paths that exist.

It's typical that buffers and passes come in matching pairs. However, one could swap out one variation of a for another, while reusing the same buffer type. Most implementations are themselves declarative recipes, with e.g. a or two, and perhaps an associated data binding. All the meat—i.e. the dispatches—is in the passes.

It's so declarative that there isn't much left inside itself. It maps logical calls into concrete ones by leveraging Live, and that's reflected entirely in what's there. It only gathers up some data it doesn't know details about, and helps ensure the sequence of compute before render before readback. This is a big clue that renderers really want to be reactive run-times instead.

Bind Group Soup

Use.GPU's initial design goal was "a unique shader for every draw call". This means its data binding fu has mostly been applied to local shader bindings. These apply only to one particular draw, and you bind the data to the shader at the same time as creating it.

This is the useShader hook. There is no separation where you first prepare the binding layout, and as such, you use it like a deferred function call, just like JSX.

// Prepare to call surfaceShader(matrix, ray, normal, size, ...)
const getSurface = useShader(surfaceShader, [
  matrix, ray, normal, size, insideRef, originRef,
  sdf, palette, pbr, ...sources
], defs);

Shader and pipeline reuse is handled via structural hashing behind the scenes: it's merely a happy benefit if two draw calls can reuse the same shader and pipeline, but absolutely not a problem if they don't. As batching is highly encouraged, and large data sets can be rendered as one, the number of draw calls tends to be low.

All local bindings are grouped in two bind groups, static and volatile. The latter allows for the transparent history feature, as well as just-in-time allocated atlases. Static bindings don't need to be 100% static, they just can't change during dispatch or rendering.

WebGPU only has four bind groups total. I previously used the other two for respectively the global view, and the concrete render pass, using up all the bind groups. This was wasteful but an unfortunate necessity, without an easy way to compose them at run-time.

Bind Group:	#0	#1	#2	#3
Use.GPU 0.13	View	Pass	Static	Volatile
Use.GPU 0.14	Pass	Static	Volatile	Free

This has been fixed in 0.14, which frees up a bind group. It also means every render pass fully owns its own view. It can pick from a set of pre-provided ones (e.g. overscanned or not), or set a custom one, the same way it finds buffers and other bindings.

Having bind group 3 free also opens up the possibility of a more traditional sub-pipeline, as seen in a traditional scene graph renderer. These can handle larger amounts of individual draw calls, all sharing the same shader template, but with different textures and parameters. My goal however is to avoid monomorphizing to this degree, unless it's absolutely necessary (e.g. with the lighting).

This required upgrading the shader linker. Given e.g. a static binding snippet such as:

use '@use-gpu/wgsl/use/types'::{ Light };

@export struct LightUniforms {
  count: u32,
  lights: array,
};

@group(PASS) @binding(1) var lightUniforms: LightUniforms;

...you can import it in Typescript like any other shader module, with the @binding as an attribute to be linked. The shader linker will understand struct types like LightUniforms with array fully now, and is able to produce e.g. a correct minimum binding size for types that cross module boundaries.

The ergonomics of useShader have been replicated here, so that useBindGroupLayout takes a set of these and prepares them into a single static bind group, managing e.g. the shader stages for you. To bind data to the bind group, a render pass delegates via useApplyPassBindGroup: this allows the source of the data to be modularized, instead of requiring every pass to know about every possible binding (e.g. lighting, shadows, SSAO, etc.). That is, while there is a separation between bind group layout and data binding, it's lazy: both are still defined in the same place.

The binding system is flexible enough end-to-end that the SSAO can e.g. be applied to the voxel raytracer from @use-gpu/voxel with zero effort required, as it also uses the shaded technique (with per fragment depth). It has a getSurface(...) shader function that raytraces and returns a surface fragment. The SSAO sampler can just attach its occlusion information to it, by decorating it in WGSL.

WGSL Types

Worth noting, this all derives from previous work on auto-generated structs for data aggregation.

It's cool tech, but it's hard to show off, because it's completely invisible on the outside, and the shader code is all ugly autogenerated glue. There's a presentation up on the site that details it at the lower level, if you're curious.

The main reason I had aggregation initially was to work around the 8 storage buffers limit in WebGPU. The Plot API needed to auto-aggregate all the different attributes of shapes, with their given spread policies, based on what the user supplied.

This allows me to offer e.g. a bulk line drawing primitive where attributes don't waste precious bandwidth on repeated data. Each ends up grouped in structs, taking up only 1 storage buffer, depending on whether it is constant or varying, per instance or per vertex:

This involves a comprehensive buffer interleaving and copying mechanism, that has to satisfy all the alignment constraints. This then leverages @use-gpu/shader's structType(…) API to generate WGSL struct types at run-time. Given a list of attributes, it returns a virtual shader module with a real symbol table. This is materialized into shader code on demand, and can be exploded into individual accessor functions as well.

Hence data sources in Use.GPU can now have a format of T or array with a WGSL shader module as the type parameter. I already had most of the pieces in place for this, but hadn't quite put it all together everywhere.

Using shader modules as the representation of types is very natural, as they carry all the WGSL attributes and GPU-only concepts. It goes far beyond what I had initially scoped for the linker, as it's all source-code-level, but it was worth it. The main limitation is that type inference only happens at link time, as binding shader modules together has to remain a fast and lazy op.

Native WGSL types are somewhat poorly aligned with the WebGPU API on the CPU side. A good chunk of @use-gpu/core is lookup tables with info about formats and types, as well as alignment and size, so it can all be resolved at run-time. There's something similar for bind group creation, where it has to translate between a few different ways of saying the same thing.

The types I expose instead are simple: TextureSource, StorageSource and LambdaSource. Everything you bind to a shader is either one of these, or a constant (by reference). They carry all the necessary metadata to derive a suitable binding and accessor.

That said, I cannot shield you from the limitations underneath. Texture formats can e.g. be renderable or not, filterable or not, writeable or not, and the specific mechanisms available to you vary. If this involves native depth buffers, you may need to use a full-screen render pass to copy data, instead of just calling copyTextureToTexture. I run into this too, and can only provide a few more convenience hooks.

I did come up with a neat way to genericize these copy shaders, using the existing WGSL type inference I had, souped up a bit. This uses simple selector functions to serve the role of reassembling types. It's finally given me a concrete way to make 'root shaders' (i.e. the entry points) generic enough to support all use. I may end up using something similar to handle the ordinary vertex and fragment entry points, which still have to be provided in various permutations.

* * *

Phew. Use.GPU is always a lot to go over. But its à la carte nature remains and that's great.

For in-house use it's already useful, especially if you need a decent GPU on a desktop anyway. I have been using it for some client work, and it seems to be making people happy. If you want to go off-road from there, you can.

It delivers on combining low-level shader code with its own stock components, without making you reinvent a lot of the wheels.

Visit usegpu.live for more and to view demos in a WebGPU capable browser.

PS: I upgraded the aging build of Jekyll that was driving this blog, so if you see anything out of the ordinary, please let me know.

The Bouquet Residence

2024-07-24T00:00:00+02:00

Keeping up appearances in tech

The word "rant" is used far too often, and in various ways.
It's meant to imply aimless, angry venting.

But often it means:

Naming problems without proposing solutions,
this makes me feel confused.

Naming problems and assigning blame,
this makes me feel bad.

I saw a remarkable pair of tweets the other day.

In the wake of the outage, the CEO of CrowdStrike sent out a public announcement. It's purely factual. The scope of the problem is identified, the known facts are stated, and the logistics of disaster relief are set in motion.

Millions of computers were affected. This is the equivalent of a frazzled official giving a brief statement in the aftermath of an earthquake, directing people to the Red Cross.

Everything is basically on fire for everyone involved. Systems are failing everywhere, some critical, and quite likely people are panicking. The important thing is to give the technicians the information and tools to fix it, and for everyone else to do what they can, and stay out of the way.

In response, a communication professional posted an 'improved' version:

Credit where credit is due, she nailed the style. 10/10. It seems unobjectionable, at first. Let's go through, shall we?

Opposite Day

First is that the CEO is "devastated." A feeling. And they are personally going to ensure it's fixed for every single user.

This focuses on the individual who is inconvenienced. Not the disaster. They take a moment out of their time to say they are so, so sorry a mistake was made. They have let you and everyone else down, and that shouldn't happen. That's their responsibility.

By this point, the original statement had already told everyone the relevant facts. Here the technical details are left to the imagination. The writer's self-assigned job is to wrap the message in a more palatable envelope.

Everyone will be working "all day, all night, all weekends," indeed, "however long it takes," to avoid it happening again.

I imagine this is meant to be inspiring and reassuring. But if I was a CrowdStrike technician or engineer, I would find it demoralizing: the boss, who will actually be personally fixing diddly-squat, is saying that the long hours of others are a sacrifice they're willing to make.

Plus, CrowdStrike's customers are in the same boat: their technicians get volunteered too. They can't magically unbrick PCs from a distance, so "until it's fully fixed for every single user" would be a promise outsiders will have to keep. Lovely.

There's even a punch line: an invitation to go contact them, the quickest way linked directly. It thanks people for reaching out.

If everything is on fire, that includes the phone lines, the inboxes, and so on. The most stupid thing you could do in such a situation is to tell more people to contact you, right away. Don't encourage it! That's why the original statement refers to pre-existing lines of communication, internal representatives, and so on. The Support department would hate the CEO too.

Root Cause

If you're wondering about the pictures, it's Hyacinth Bucket, from 90s UK sitcom Keeping Up Appearances, who would always insist "it's pronounced Bouquet."

Hyacinth's ambitions always landed her out of her depth, surrounded by upper-class people she's trying to impress, in the midst of an embarrassing disaster. Her increasingly desperate attempts to save face, which invariably made things worse, are the main source of comedy.

Try reading that second statement in her voice.

I’m devastated to see the scale of today’s outage and will be personally working on it together with our team until it’s fully fixed for every single user.

But I wanted to take a moment to come here and tell you that I am sorry. People around the world rely on us, and incidents like this can’t happen. This came from an error that ultimately is my responsibility.

I can hear it perfectly, telegraphing Britishness to restore dignity for all. If she were in tech she would give that statement.

It's about reputation management first, projecting the image of competence and accountability. But she's giving the speech in front of a burning building, not realizing the entire exercise is futile. Worse, she thinks she's nailing it.

If CrowdStrike had sent this out, some would've applauded and called it an admirable example of wise and empathetic communication. Real leadership qualities.

But it's the exact opposite. It focuses on the wrong things, it alienates the staff, and it definitely amplifies the chaos. It's Monty Python-esque.

Apologizing is pointless here, the damage is already done. What matters is how severe it is and whether it could've been avoided. This requires a detailed root-cause analysis and remedy. Otherwise you only have their word. Why would that re-assure you?

The original restated the company's mission: security and stability. Those are the stakes to regain a modicum of confidence.

You may think that I'm reading too much into this. But I know the exact vibe on an engineering floor when the shit hits the fan. I also know how executives and staff without that experience end up missing the point entirely. I once worked for a Hyacinth Bucket. It's not an anecdote, it's allegory.

They simply don't get the engineering mindset, and confuse authority with ownership. They step on everyone's toes without realizing, because they're constantly wearing clown shoes. Nobody tells them.

Softness as a Service

The change in style between #1 and #2 is really a microcosm of the conflict that has been broiling in tech for ~15 years now. I don't mean the politics, but the shifting of norms, of language and behavior.

It's framed as a matter of interpersonal style, which needs to be welcoming and inclusive. In practice this means they assert or demand that style #2 be the norm, even when #1 is advisable or required.

Factuality is seen as deficient, improper and primitive. It's a form of doublethink: everyone's preference is equally valid, except yours, specifically.

But the difference is not a preference. It's about what actually works and what doesn't. Style #1 is aimed at the people who have to fix it. Style #2 is aimed at the people who can't do anything until it's fixed. Who should they be reaching out to?

In #2, communication becomes an end in itself, not a means of conveying information. It's about being seen saying the words, not living them. Poking at the statement makes it fall apart.

When this becomes the norm in a technical field, it has deep consequences:

Critique must be gift-wrapped in flattery, and is not allowed to actually land.
Mistakes are not corrected, and sentiment takes precedence over effectiveness.
Leaders speak lofty words far from the trenches to save face.
The people they thank the loudest are the ones they pay the least.

Inevitably, quiet competence is replaced with gaudy chaos. Everyone says they're sorry and responsible, but nobody actually is. Nobody wants to resign either. Sound familiar?

Cope and Soothe

The elephant in the room is that #1 is very masculine, while #2 is more feminine. When you hear "women are more empathetic communicators", this is what it means. They tend to focus on the individual and their relation to them, not the team as a whole and its mission.

Complaints that tech is too "male dominated" and "notoriously hostile to women" are often just this. Tech was always full of types who won't preface their proposals and criticisms with fluff, and instead lean into autism. When you're used to being pandered to, neutrality feels like vulgarity.

The notable exceptions are rare and usually have an exasperating lead up. Tech is actually one of the most accepting and egalitarian fields around. The maintainers do a mostly thankless job.

"Oh so you're saying there's no misogyny in tech?" No I'm just saying misogyny doesn't mean "something 1 woman hates".

The tone is really a distraction. If someone drops an analysis, saying shit or get off the pot, even very kindly and patiently, some will still run away screaming. Like an octopus spraying ink, they'll deploy a nasty form of #2 as a distraction. That's the real issue.

Many techies, in their naiveté, believed the cultural reformers when they showed up to gentrify them. They obediently branded heretics like James Damore, and burned witches like Richard Stallman. Thanks to racism, words like 'master' and 'slave' are now off-limits as technical terms. Ironic, because millions of computers just crashed because they worked exactly like that.

The cope is to pretend that nothing has truly changed yet, and more reform is needed. In fact, everything has already changed. Tech forums used to be crucibles for distilling insight, but now they are guarded jealously by people more likely to flag and ban than strongly disagree.

I once got flagged on HN because I pointed out Twitter's mass lay-offs were a response to overhiring, and that people were rooting for the site to fail after Musk bought it. It suggested what we all know now: that the company would not implode after trimming the dead weight, and that they'd never forgive him for it.

Diversity is now associated with incompetence, because incompetent people have spent over a decade reaching for it as an excuse. In their attempts to fight stereotypes, they ensured the stereotypes came true.

Bait and Snitch

The outcry tends to be: "We do all the same things you do, but still we get treated differently!" But they start from the conclusion and work their way backwards. This is what the rewritten statement does: it tries to fix the relationship before fixing the problem.

The average woman and man actually do things very differently in the first place. Individual men and women choose. And others respond accordingly. The people who build and maintain the world's infrastructure prefer the masculine style for a reason: it keeps civilization running, and helps restore it when it breaks. A disaster announcement does not need to be relatable, it needs to be effective.

Furthermore, if the job of shoveling shit falls on you, no amount of flattery or oversight will make that more pleasant. It really won't. Such commentary is purely for the benefit of the ones watching and trying to look busy. It makes it worse, stop pretending otherwise.

There's little loyalty in tech companies nowadays, and it's no surprise. Project and product managers are acting more like demanding clients to their own team, than leaders. "As a user, I want..." Yes, but what are you going to do about it? Do you even know where to start?

What's perceived as a lack of sensitivity is actually the presence of sensibility. It's what connects the words to the reality on the ground. It does not need to be improved or corrected, it just needs to be respected. And yes it's a matter of gender, because bashing men and masculine norms has become a jolly recreational sport in the overculture. Mature women know it.

It seems impossible to admit. The entire edifice of gender equality depends on there not being a single thing men are actually better at, even just on average. Where men and women's instincts differ, women must be right.

It's childish, and not harmless either. It dares you to call it out, so they can then play the wounded victim, and paint you as the unreasonable asshole who is mean. This is supposed to invalidate the argument.

* * *

This post is of course a giant cannon pointing in the opposite direction, sitting on top of a wall. Its message will likely fly over the reformers' heads.

If they read it at all, they'll selectively quote or paraphrase, call me a tech-bro, and spool off some sentences they overheard, like an LLM. It's why they adore AI, and want it to be exactly as sycophantic as them. They don't care that it makes stuff up wholesale, because it makes them look and feel competent. It will never tell them to just fuck off already.

Think less about what is said, more about what is being done. Otherwise the next CrowdStrike will probably be worse.

I is for Intent

2024-02-05T00:00:00+01:00

Why your app turned into spaghetti

"I do not like your software sir,
your architecture's poor.

Your users can't do anything,
unless you code some more.

This isn't how it used to be,
we had this figured out.

But then you had to mess it up
by moving into clouds."

There's a certain kind of programmer. Let's call him Stanley.

Stanley has been around for a while, and has his fair share of war stories. The common thread is that poorly conceived and specced solutions lead to disaster, or at least, ongoing misery. As a result, he has adopted a firm belief: it should be impossible for his program to reach an invalid state.

Stanley loves strong and static typing. He's a big fan of pattern matching, and enums, and discriminated unions, which allow correctness to be verified at compile time. He also has strong opinions on errors, which must be caught, logged and prevented. He uses only ACID-compliant databases, wants foreign keys and triggers to be enforced, and wraps everything in atomic transactions.

He hates any source of uncertainty or ambiguity, like untyped JSON or plain-text markup. His APIs will accept data only in normalized and validated form. When you use a Stanley lib, and it doesn't work, the answer will probably be: "you're using it wrong."

Stanley is most likely a back-end or systems-level developer. Because nirvana in front-end development is reached when you understand that this view of software is not just wrong, but fundamentally incompatible with the real world.

I will prove it.

State Your Intent

Take a text editor. What happens if you press the up and down arrows?

The keyboard cursor (aka caret) moves up and down. Duh. Except it also moves left and right.

The editor state at the start has the caret on line 1 column 6. Pressing down will move it to line 2 column 6. But line 2 is too short, so the caret is forcibly moved left to column 1. Then, pressing down again will move it back to column 6.

It should be obvious that any editor that didn't remember which column you were actually on would be a nightmare to use. You know it in your bones. Yet this only works because the editor allows the caret to be placed on a position that "does not exist." What is the caret state in the middle? It is both column 1 and column 6.

To accommodate this, you need more than just a View that is a pure function of a State, as is now commonly taught. Rather, you need an Intent, which is the source of truth that you mutate... and which is then parsed and validated into a State. Only then can it be used by the View to render the caret in the right place.

To edit the intent, aka what a classic Controller does, is a bit tricky. When you press left/right, it should determine the new Intent.column based on the validated State.column +/- 1. But when you press up/down, it should keep the Intent.column you had before and instead change only Intent.line. New intent is a mixed function of both previous intent and previous state.

The general pattern is that you reuse Intent if it doesn't change, but that new computed Intent should be derived from State. Note that you should still enforce normal validation of Intent.column when editing too: you don't allow a user to go past the end of a line. Any new intent should be as valid as possible, but old intent should be preserved as is, even if non-sensical or inapplicable.

Functionally, for most of the code, it really does look and feel as if the state is just State, which is valid. It's just that when you make 1 state change, the app may decide to jump into a different State than one would think. When this happens, it means some old intent first became invalid, but then became valid again due to a subsequent intent/state change.

This is how applications actually work IRL. FYI.

Knives and Forks

I chose a text editor as an example because Stanley can't dismiss this as just frivolous UI polish for limp wristed homosexuals. It's essential that editors work like this.

The pattern is far more common than most devs realize:

A tree view remembers the expand/collapse state for rows that are hidden.
Inspector tabs remember the tab you were on, even if currently disabled or inapplicable.
Toggling a widget between type A/B/C should remember all the A, B and C options, even if mutually exclusive.

All of these involve storing and preserving something unknown, invalid or unused, and bringing it back into play later.

More so, if software matches your expected intent, it's a complete non-event. What looks like a "surprise hidden state transition" to a programmer is actually the exact opposite. It would be an unpleasant surprise if that extra state transition didn't occur. It would only annoy users: they already told the software what they wanted, but it keeps forgetting.

The ur-example is how nested popup menus should work: good implementations track the motion of the cursor so you can move diagonally from parent to child, without falsely losing focus:

This is an instance of the goalkeeper's curse: people rarely compliment or notice the goalkeeper if they do their job, only if they mess up. Successful applications of this principle are doomed to remain unnoticed and unstudied.

Validation is not something you do once, discarding the messy input and only preserving the normalized output. It's something you do continuously and non-destructively, preserving the mess as much as possible. It's UI etiquette: the unspoken rules that everyone expects but which are mostly undocumented folklore.

This poses a problem for most SaaS in the wild, both architectural and existential. Most APIs will only accept mutations that are valid. The goal is for the database to be a sequence of fully valid states:

The smallest possible operation in the system is a fully consistent transaction. This flattens any prior intent.

In practice, many software deviates from this ad-hoc. For example, spreadsheets let you create cyclic references, which is by definition invalid. The reason it must let you do this is because fixing one side of a cyclic reference also fixes the other side. A user wants and needs to do these operations in any order. So you must allow a state transition through an invalid state:

This requires an effective Intent/State split, whether formal or informal.

Because cyclic references can go several levels deep, identifying one cyclic reference may require you to spider out the entire dependency graph. This is functionally equivalent to identifying all cyclic references—dixit Dijkstra. Plus, you need to produce sensible, specific error messages. Many "clever" algorithmic tricks fail this test.

Now imagine a spreadsheet API that doesn't allow for any cyclic references ever. This still requires you to validate the entire resulting model, just to determine if 1 change is allowed. It still requires a general validate(Intent). In short, it means your POST and PUT request handlers need to potentially call all your business logic.

That seems overkill, so the usual solution is bespoke validators for every single op. If the business logic changes, there is a risk your API will now accept invalid intent. And the app was not built for that.

If you flip it around and assume intent will go out-of-bounds as a normal matter, then you never have this risk. You can write the validation in one place, and you reuse it for every change as a normal matter of data flow.

Note that this is not cowboy coding. Records and state should not get irreversibly corrupted, because you only ever use valid inputs in computations. If the system is multiplayer, distributed changes should still be well-ordered and/or convergent. But the data structures you're encoding should be, essentially, entirely liberal to your user's needs.

Consider git. Here, a "unit of intent" is just a diff applied to a known revision ID. When something's wrong with a merge, it doesn't crash, or panic, or refuse to work. It just enters a conflict state. This state is computed by merging two incompatible intents.

It's a dirty state that can't be turned into a clean commit without human intervention. This means git must continue to work, because you need to use git to clean it up. So git is fully aware when a conflict is being resolved.

As a general rule, the cases where you actually need to forbid a mutation which satisfies all the type and access constraints are small. A good example is trying to move a folder inside itself: the file system has to remain a sensibly connected tree. Enforcing the uniqueness of names is similar, but also comes with a caution: falsehoods programmers believe about names. Adding (Copy) to a duplicate name is usually better than refusing to accept it, and most names in real life aren't unique at all. Having user-facing names actually requires creating tools and affordances for search, renaming references, resolving duplicates, and so on.

Even among front-end developers, few people actually grok this mental model of a user. It's why most React(-like) apps in the wild are spaghetti, and why most blog posts about React gripes continue to miss the bigger picture. Doing React (and UI) well requires you to unlearn old habits and actually design your types and data flow so it uses potentially invalid input as its single source of truth. That way, a one-way data flow can enforce the necessary constraints on the fly.

The way Stanley likes to encode and mutate his data is how programmers think about their own program: it should be bug-free and not crash. The mistake is to think that this should also apply to any sort of creative process that program is meant to enable. It would be like making an IDE that only allows you to save a file if the code compiles and passes all the tests.

Trigger vs Memo

Coding around intent is a very hard thing to teach, because it can seem overwhelming. But what's overwhelming is not doing this. It leads to codebases where every new feature makes ongoing development harder, because no part of the codebase is ever finished. You will sprinkle copies of your business logic all over the place, in the form of request validation, optimistic local updaters, and guess-based cache invalidation.

If this is your baseline experience, your estimate of what is needed to pull this off will simply be wrong.

In the traditional MVC model, intent is only handled at the individual input widget or form level. e.g. While typing a number, the intermediate representation is a string. This may be empty, incomplete or not a number, but you temporarily allow that.

I've never seen people formally separate Intent from State in an entire front-end. Often their state is just an adhoc mix of both, where validation constraints are relaxed in the places where it was most needed. They might just duplicate certain fields to keep a validated and unvalidated variant side by side.

There is one common exception. In a React-like, when you do a useMemo with a derived computation of some state, this is actually a perfect fit. The eponymous useState actually describes Intent, not State, because the derived state is ephemeral. This is why so many devs get lost here.

const state = useMemo(
  () => validate(intent),
  [intent]
);

Their usual instinct is that every action that has knock-on effects should be immediately and fully realized, as part of one transaction. Only, they discover some of those knock-on effects need to be re-evaluated if certain circumstances change. Often to do so, they need to undo and remember what it was before. This is then triggered anew via a bespoke effect, which requires a custom trigger and mutation. If they'd instead deferred the computation, it could have auto-updated itself, and they would've still had the original data to work with.

e.g. In a WYSIWYG scenario, you often want to preview an operation as part of mouse hovering or dragging. It should look like the final result. You don't need to implement custom previewing and rewinding code for this. You just need the ability to layer on some additional ephemeral intent on top of the intent that is currently committed. Rewinding just means resetting that extra intent back to empty.

You can make this easy to use by treating previews as a special kind of transaction: now you can make preview states with the same code you use to apply the final change. You can also auto-tag the created objects as being preview-only, which is very useful. That is: you can auto-translate editing intent into preview intent, by messing with the contents of a transaction. Sounds bad, is actually good.

The same applies to any other temporary state, for example, highlighting of elements. Instead of manually changing colors, and creating/deleting labels to pull this off, derive the resolved style just-in-time. This is vastly simpler than doing it all on 1 classic retained model. There, you run the risk of highlights incorrectly becoming sticky, or shapes reverting to the wrong style when un-highlighted. You can architect it so this is simply impossible.

The trigger vs memo problem also happens on the back-end, when you have derived collections. Each object of type A must have an associated type B, created on-demand for each A. What happens if you delete an A? Do you delete the B? Do you turn the B into a tombstone? What if the relationship is 1-to-N, do you need to garbage collect?

If you create invisible objects behind the scenes as a user edits, and you never tell them, expect to see a giant mess as a result. It's crazy how often I've heard engineers suggest a user should only be allowed to create something, but then never delete it, as a "solution" to this problem. Everyday undo/redo precludes it. Don't be ridiculous.

The problem is having an additional layer of bookkeeping you didn't need. The source of truth was collection A, but you created a permanent derived collection B. If you instead make B ephemeral, derived via a stateless computation, then the problem goes away. You can still associate data with B records, but you don't treat B as the authoritative source for itself. This is basically what a WeakMap is.

In database land this can be realized with a materialized view, which can be incremental and subscribed to. Taken to its extreme, this turns into event-based sourcing, which might seem like a panacea for this mindset. But in most cases, the latter is still a system by and for Stanley. The event-based nature of those systems exists to support housekeeping tasks like migration, backup and recovery. Users are not supposed to be aware that this is happening. They do not have any view into the event log, and cannot branch and merge it. The exceptions are extremely rare.

It's not a system for working with user intent, only for flattening it, because it's append-only. It has a lot of the necessary basic building blocks, but substitutes programmer intent for user intent.

What's most nefarious is that the resulting tech stacks are often quite big and intricate, involving job queues, multi-layered caches, distribution networks, and more. It's a bunch of stuff that Stanley can take joy and pride in, far away from users, with "hard" engineering challenges. Unlike all this *ugh* JavaScript, which is always broken and unreliable and uninteresting.

Except it's only needed because Stanley only solved half the problem, badly.

Patch or Bust

When factored in from the start, it's actually quite practical to split Intent from State, and it has lots of benefits. Especially if State is just a more constrained version of the same data structures as Intent. This doesn't need to be fully global either, but it needs to encompass a meaningful document or workspace to be useful.

It does create an additional problem: you now have two kinds of data in circulation. If reading or writing requires you to be aware of both Intent and State, you've made your code more complicated and harder to reason about.

More so, making a new Intent requires a copy of the old Intent, which you mutate or clone. But you want to avoid passing Intent around in general, because it's fishy data. It may have the right types, but the constraints and referential integrity aren't guaranteed. It's a magnet for the kind of bugs a type-checker won't catch.

I've published my common solution before: turn changes into first-class values, and make a generic update of type Update be the basic unit of change. As a first approximation, consider a shallow merge {...value, ...update}. This allows you to make an updateIntent(update) function where update only specifies the fields that are changing.

In other words, Update looks just like Update and can be derived 100% from State, without Intent. Only one place needs to have access to the old Intent, all other code can just call that. You can make an app intent-aware without complicating all the code.

If your state is cleaved along orthogonal lines, then this is all you need. i.e. If column and line are two separate fields, then you can selectively change only one of them. If they are stored as an XY tuple or vector, now you need to be able to describe a change that only affects either the X or Y component.

const value = {
  hello: 'text',
  foo: { bar: 2, baz: 4 },
};

const update = {
  hello: 'world',
  foo: { baz: 50 },
};

expect(
  patch(value, update)
).toEqual({
  hello: 'world',
  foo: { bar: 2, baz: 50 },
});

So in practice I have a function patch(value, update) which implements a comprehensive superset of a deep recursive merge, with full immutability. It doesn't try to do anything fancy with arrays or strings, they're just treated as atomic values. But it allows for precise overriding of merging behavior at every level, as well as custom lambda-based updates. You can patch tuples by index, but this is risky for dynamic lists. So instead you can express e.g. "append item to list" without the entire list, as a lambda.

I've been using patch for years now, and the uses are myriad. To overlay a set of overrides onto a base template, patch(base, overrides) is all you need. It's the most effective way I know to erase a metric ton of {...splats} and ?? defaultValues and != null from entire swathes of code. This is a real problem.

You could also view this as a "poor man's OT", with the main distinction being that a patch update only describes the new state, not the old state. Such updates are not reversible on their own. But they are far simpler to make and apply.

It can still power a global undo/redo system, in combination with its complement diff(A, B): you can reverse an update by diffing in the opposite direction. This is an operation which is formalized and streamlined into revise(…), so that it retains the exact shape of the original update, and doesn't require B at all. The structure of the update is sufficient information: it too encodes some intent behind the change.

With patch you also have a natural way to work with changes and conflicts as values. The earlier WYSIWIG scenario is just patch(commited, ephemeral) with bells on.

The net result is that mutating my intent or state is as easy as doing a {...value, ...update} splat, but I'm not falsely incentivized to flatten my data structures.

Instead it frees you up to think about what the most practical schema actually is from the data's point of view. This is driven by how the user wishes to edit it, because that's what you will connect it to. It makes you think about what a user's workspace actually is, and lets you align boundaries in UX and process with boundaries in data structure.

Remember: most classic "data structures" are not about the structure of data at all. They serve as acceleration tools to speed up specific operations you need on that data. Having the reads and writes drive the data design was always part of the job. What's weird is that people don't apply that idea end-to-end, from database to UI and back.

SQL tables are shaped the way they are because it enables complex filters and joins. However, I find this pretty counterproductive: it produces derived query results that are difficult to keep up to date on a client. They also don't look like any of the data structures I actually want to use in my code.

A Bike Shed of Schemas

This points to a very under-appreciated problem: it is completely pointless to argue about schemas and data types without citing specific domain logic and code that will be used to produce, edit and consume it. Because that code determines which structures you are incentivized to use, and which structures will require bespoke extra work.

From afar, column and line are just XY coordinates. Just use a 2-vector. But once you factor in the domain logic and etiquette, you realize that the horizontal and vertical directions have vastly different rules applied to them, and splitting might be better. Which one do you pick?

This applies to all data. Whether you should put items in a List or a Map largely depends on whether the consuming code will loop over it, or need random access. If an API only provides one, consumers will just build the missing Map or List as a first step. This is O(n log n) either way, because of sorting.

The method you use to read or write your data shouldn't limit use of everyday structure. Not unless you have a very good reason. But this is exactly what happens.

A lot of bad choices in data design come down to picking the "wrong" data type simply because the most appropriate one is inconvenient in some cases. This then leads to Conway's law, where one team picks the types that are most convenient only for them. The other teams are stuck with it, and end up writing bidirectional conversion code around their part, which will never be removed. The software will now always have this shape, reflecting which concerns were considered essential. What color are your types?

{
  order: [4, 11, 9, 5, 15, 43],
  values: {
    4: {...},
    5: {...},
    9: {...},
    11: {...},
    15: {...},
    43: {...},
  },
);

For List vs Map, you can have this particular cake and eat it too. Just provide a List for the order and a Map for the values. If you structure a list or tree this way, then you can do both iteration and ID-based traversal in the most natural and efficient way. Don't underestimate how convenient this can be.

This also has the benefit that "re-ordering items" and "editing items" are fully orthogonal operations. It decomposes the problem of "patching a list of objects" into "patching a list of IDs" and "patching N separate objects". It makes code for manipulating lists and trees universal. It lets you to decide on a case by case basis whether you need to garbage collect the map, or whether preserving unused records is actually desirable.

Limiting it to ordinary JSON or JS types, rather than going full-blown OT or CRDT, is a useful baseline. With sensible schema design, at ordinary editing rates, CRDTs are overkill compared to the ability to just replay edits, or notify conflicts. This only requires version numbers and retries.

Users need those things anyway: just because a CRDT converges when two people edit, doesn't mean the result is what either person wants. The only case where OTs/CRDTs are absolutely necessary is rich-text editing, and you need bespoke UI solutions for that anyway. For simple text fields, last-write-wins is perfectly fine, and also far superior to what 99% of RESTy APIs do.

A CRDT is just a mechanism that translates partially ordered intents into a single state. Like, it's cool that you can make CRDT counters and CRDT lists and whatnot... but each CRDT implements only one particular resolution strategy. If it doesn't produce the desired result, you've created invalid intent no user expected. With last-write-wins, you at least have something 1 user did intend. Whether this is actually destructive or corrective is mostly a matter of schema design and minimal surface area, not math.

The main thing that OTs and CRDTs do well is resolve edits on ordered sequences, like strings. If two users are typing text in the same doc, edits higher-up will shift edits down below, which means the indices change when rebased. But if you are editing structured data, you can avoid referring to indices entirely, and just use IDs instead. This sidesteps the issue, like splitting order from values.

For the order, there is a simple solution: a map with a fractional index, effectively a dead-simple list CRDT. It just comes with some overhead.

Using a CRDT for string editing might not even be enough. Consider Google Docs-style comments anchored to that text: their indices also need to shift on every edit. Now you need a bespoke domain-aware CRDT. Or you work around it by injecting magic markers into the text. Either way, it seems non-trivial to decouple a CRDT from the specific target domain of the data inside. The constraints get mixed in.

If you ask me, this is why the field of real-time web apps is still in somewhat of a rut. It's mainly viewed as a high-end technical problem: how do we synchronize data structures over a P2P network without any data conflicts? What they should be asking is: what is the minimal amount of structure we need to reliably synchronize, so that users can have a shared workspace where intent is preserved, and conflicts are clearly signposted. And how should we design our schemas, so that our code can manipulate the data in a straightforward and reliable way? Fixing non-trivial user conflicts is simply not your job.

Most SaaS out there doesn't need any of this technical complexity. Consider that a good multiplayer app requires user presence and broadcast anyway. The simplest solution is just a persistent process on a single server coordinating this, one per live workspace. It's what most MMOs do. In fast-paced video games, this even involves lag compensation. Reliable ordering is not the big problem.

The situations where this doesn't scale, or where you absolutely must be P2P, are a minority. If you run into them, you must be doing very well. The solution that I've sketched out here is explicitly designed so it can comfortably be done by small teams, or even just 1 person.

The (private) CAD app I showed glimpses of above is entirely built this way. It's patch all the way down and it's had undo/redo from day 1. It also has a developer mode where you can just edit the user-space part of the data model, and save/load it.

When the in-house designers come to me with new UX requests, they often ask: "Is it possible to do ____?" The answer is never a laborious sigh from a front-end dev with too much on their plate. It's "sure, and we can do more."

If you're not actively aware the design of schemas and code is tightly coupled, your codebase will explode, and the bulk of it will be glue. Much of it just serves to translate generalized intent into concrete state or commands. Arguments about schemas are usually just hidden debates about whose job it is to translate, split or join something. This isn't just an irrelevant matter of "wire formats" because changing the structure and format of data also changes how you address specific parts of it.

In an interactive UI, you also need a reverse path, to apply edits. What I hope you are starting to realize is that this is really just the forward path in reverse, on so many levels. The result of a basic query is just the ordered IDs of the records that it matched. A join returns a tuple of record IDs per row. If you pre-assemble the associated record data for me, you actually make my job as a front-end dev harder, because there are multiple forward paths for the exact same data, in subtly different forms. What I want is to query and mutate the same damn store you do, and be told when what changes. It's table-stakes now.

With well-architected data, this can be wired up mostly automatically, without any scaffolding. The implementations you encounter in the wild just obfuscate this, because they don't distinguish between the data store and the model it holds. The fact that the data store should not be corruptible, and should enforce permissions and quotas, is incorrectly extended to the entire model stored inside. But that model doesn't belong to Stanley, it belongs to the user. This is why desktop applications didn't have a "Data Export". It was just called Load and Save, and what you saved was the intent, in a file.

Having a universal query or update mechanism doesn't absolve you from thinking about this either, which is why I think the patch approach is so rare: it looks like cowboy coding if you don't have the right boundaries in place. Patch is mainly for user-space mutations, not kernel-space, a concept that applies to more than just OS kernels. User-space must be very forgiving.

If you avoid it, you end up with something like GraphQL, a good example of solving only half the problem badly. Its getter assembles data for consumption by laboriously repeating it in dozens of partial variations. And it turns the setter part into an unsavory mix of lasagna and spaghetti. No wonder, it was designed for a platform that owns and hoards all your data.

* * *

Viewed narrowly, Intent is just a useful concept to rethink how you enforce validation and constraints in a front-end app. Viewed broadly, it completely changes how you build back-ends and data flows to support that. It will also teach you how adding new aspects to your software can reduce complexity, not increase it, if done right.

A good metric is to judge implementation choices by how many other places of the code need to care about them. If a proposed change requires adjustments literally everywhere else, it's probably a bad idea, unless the net effect is to remove code rather than add.

I believe reconcilers like React or tree-sitter are a major guide stone here. What they do is apply structure-preserving transforms on data structures, and incrementally. They actually do the annoying part for you. I based Use.GPU on the same principles, and use it to drive CPU canvases too. The tree-based structure reflects that one function's state just might be the next function's intent, all the way down. This is a compelling argument that the data and the code should have roughly the same shape.

You will also conclude there is nothing more nefarious than a hard split between back-end and front-end. You know, coded by different people, where each side is only half-aware of the other's needs, but one sits squarely in front of the other. Well-intentioned guesses about what the other end needs will often be wrong. You will end up with data types and query models that cannot answer questions concisely and efficiently, and which must be babysat to not go stale.

In the last 20 years, little has changed here in the wild. On the back-end, it still looks mostly the same. Even when modern storage solutions are deployed, people end up putting SQL- and ORM-like layers on top, because that's what's familiar. The split between back-end and database has the exact same malaise.

None of this work actually helps make the app more reliable, it's the opposite: every new feature makes on-going development harder. Many "solutions" in this space are not solutions, they are copes. Maybe we're overdue for a NoSQL-revival, this time with a focus on practical schema design and mutation? SQL was designed to model administrative business processes, not live interaction. I happen to believe a front-end should sit next to the back-end, not in front of it, with only a thin proxy as a broker.

What I can tell you for sure is: it's so much better when intent is a first-class concept. You don't need nor want to treat user data as something to pussy-foot around, or handle like it's radioactive. You can manipulate and transport it without a care. You can build rich, comfy functionality on top. Once implemented, you may find yourself not touching your network code for a very long time. It's the opposite of overwhelming, it's lovely. You can focus on building the tools your users need.

This can pave the way for more advanced concepts like OT and CRDT, but will show you that neither of them is a substitute for getting your application fundamentals right.

In doing so, you reach a synthesis of Dijkstra and anti-Dijkstra: your program should be provably correct in its data flow, which means it can safely break in completely arbitrary ways.

Because the I in UI meant "intent" all along.

More:

Stable Fiddusion

2023-10-02T00:00:00+02:00

Frequency-domain blue noise generator

In computer graphics, stochastic methods are so hot right now. All rendering turns into calculus, except you solve the integrals by numerically sampling them.

As I showed with Teardown, this is all based on random noise, hidden with a ton of spatial and temporal smoothing. For this, you need a good source of high quality noise. There have been a few interesting developments in this area, such as Alan Wolfe et al.'s Spatio-Temporal Blue Noise.

This post is about how I designed noise in frequency space. I will cover:

What is blue noise?
Designing indigo noise
How swap works in the frequency domain
Heuristics and analysis to speed up search
Implementing it in WebGPU

Along the way I will also show you some "street" DSP math. This illustrates how getting comfy in this requires you to develop deep intuition about complex numbers. But complex doesn't mean complicated. It can all be done on a paper napkin.

The WebGPU interface I built

What I'm going to make is this:

If properly displayed, this image should look eerily even. But if your browser is rescaling it incorrectly, it may not be exactly right.

Colorless Blue Ideas

I will start by just recapping the essentials. If you're familiar, skip to the next section.

Ordinary random generators produce uniform white noise: every value is equally likely, and the average frequency spectrum is flat.

Time domain

Frequency domain

To a person, this doesn't actually seem fully 'random', because it has clusters and voids. Similarly, a uniformly random list of coin flips will still have long runs of heads or tails in it occasionally.

What a person would consider evenly random is usually blue noise: it prefers to alternate between heads and tails, and avoids long runs entirely. It is 'more random than random', biased towards the upper frequencies, i.e. the blue part of the spectrum.

Time domain

Frequency domain

Blue noise is great for e.g. dithering, because when viewed from afar, or blurred, it tends to disappear. With white noise, clumps remain after blurring:

Blurred white noise

Blurred blue noise

Blueness is a delicate property. If you have e.g. 3D blue noise in a volume XYZ, then a single 2D XY slice is not blue at all:

XYZ spectrum

XY slice

XY spectrum

The samples are only evenly distributed in 3D, i.e. when you consider each slice in front and behind it too.

Blue noise being delicate means that nobody really knows of a way to generate it statelessly, i.e. as a pure function f(x,y,z). Algorithms to generate it must factor in the whole, as noise is only blue if every single sample is evenly spaced. You can make blue noise images that tile, and sample those, but the resulting repetition may be noticeable.

Because blue noise is constructed, you can make special variants.

Uniform Blue Noise has a uniform distribution of values, with each value equally likely. An 8-bit 256x256 UBN image will have each unique byte appear exactly 256 times.
Projective Blue Noise can be projected down, so that a 3D volume XYZ flattened into either XY, YZ or ZX is still blue in 2D, and same for X, Y and Z in 1D.
Spatio-Temporal Blue Noise (STBN) is 3D blue noise created specifically for use in real-time rendering:
- Every 2D slice XY is 2D blue noise
- Every Z row is 1D blue noise

This means XZ or YZ slices of STBN are not blue. Instead, it's designed so that when you average out all the XY slices over Z, the result is uniform gray, again without clusters or voids. This requires the noise in all the slices to perfectly complement each other, a bit like overlapping slices of translucent swiss cheese.

This is the sort of noise I want to generate.

Indigo STBN 64x64x16

XYZ spectrum

Sleep Furiously

A blur filter's spectrum is the opposite of blue noise: it's concentrated in the lowest frequencies, with a bump in the middle.

If you blur the noise, you multiply the two spectra. Very little is left: only the ring-shaped overlap, creating a band-pass area.

This is why blue noise looks good when smoothed, and is used in rendering, with both spatial (2D) and temporal smoothing (1D) applied.

Blur filters can be designed. If a blur filter is perfectly low-pass, i.e. ~zero amplitude for all frequencies > $ f_{\rm{lowpass}} $ , then nothing is left of the upper frequencies past a point.

If the noise is shaped to minimize any overlap, then the result is actually noise free. The dark part of the noise spectrum should be large and pitch black. The spectrum shouldn't just be blue, it should be indigo.

When people say you can't design noise in frequency space, what they mean is that you can't merely apply an inverse FFT to a given target spectrum. The resulting noise is gaussian, not uniform. The missing ingredient is the phase: all the frequencies need to be precisely aligned to have the right value distribution.

This is why you need a specialized algorithm.

The STBN paper describes two: void-and-cluster, and swap. Both of these are driven by an energy function. It works in the spatial/time domain, based on the distances between pairs of samples. It uses a "fall-off parameter" sigma to control the effective radius of each sample, with a gaussian kernel.

$$ E(M) = \sum E(p,q) = \sum \exp \left( - \frac{||\mathbf{p} - \mathbf{q}||^2}{\sigma^2_i}-\frac{||\mathbf{V_p} - \mathbf{V_q}||^{d/2}}{\sigma^2_s} \right) $$

STBN (Wolfe et al.)

The swap algorithm is trivially simple. It starts from white noise and shapes it:

Start with e.g. 256x256 pixels initialized with the bytes 0-255 repeated 256 times in order
Permute all the pixels into ~white noise using a random order
Now iterate: randomly try swapping two pixels, check if the result is "more blue"

This is guaranteed to preserve the uniform input distribution perfectly.

The resulting noise patterns are blue, but they still have some noise in all the lower frequencies. The only blur filter that could get rid of it all, is one that blurs away all the signal too. My 'simple' fix is just to score swaps in the frequency domain instead.

If this seems too good to be true, you should know that a permutation search space is catastrophically huge. If any pixel can be swapped with any other pixel, the number of possible swaps at any given step is O(N²). In a 256x256 image, it's ~2 billion.

The goal is to find a sequence of thousands, millions of random swaps, to turn the white noise into blue noise. This is basically stochastic bootstrapping. It's the bulk of good old fashioned AI, using simple heuristics, queues and other tools to dig around large search spaces. If there are local minima in the way, you usually need more noise and simulated annealing to tunnel over those. Usually.

This set up is somewhat simplified by the fact that swaps are symmetric (i.e. (A,B) = (B,A)), but also that applying swaps S1 and S2 is the same as applying swaps S2 and S1 as long as they don't overlap.

Good Energy

Let's take it one hurdle at a time.

It's not obvious that you can change a signal's spectrum just by re-ordering its values over space/time, but this is easy to illustrate.

Take any finite 1D signal, and order its values from lowest to highest. You will get some kind of ramp, approximating a sawtooth wave. This concentrates most of the energy in the first non-DC frequency:

Now split the odd values from the even values, and concatenate them. You will now have two ramps, with twice the frequency:

You can repeat this to double the frequency all the way up to Nyquist. So you have a lot of design freedom to transfer energy from one frequency to another.

In fact the Fourier transform has the property that energy in the time and frequency domain is conserved:

$$ \int_{-\infty}^\infty |f(x)|^2 \, dx = \int_{-\infty}^\infty |\widehat{f}(\xi)|^2 \, d\xi $$

This means the sum of $ |\mathrm{spectrum}_k|^2 $ remains constant over pixel swaps. We then design a target curve, e.g. a high-pass cosine:

$$ \mathrm{target}_k = \frac{1 - \cos \frac{k \pi}{n} }{2} $$

This can be fit and compared to the current noise spectrum to get the error to minimize.

However, I don't measure the error in energy $ |\mathrm{spectrum}_k|^2 $ but in amplitude $ |\mathrm{spectrum}_k| $. I normalize the spectrum and the target into distributions, and take the L2 norm of the difference, i.e. a sqrt of the sum of squared errors:

$$ \mathrm{error}_k = \frac{\mathrm{target}_k}{||\mathbf{target}||} - \frac{|\mathrm{spectrum}_k|}{||\mathbf{spectrum}||} $$ $$ \mathrm{loss}^2 = ||\mathbf{error}||^2 $$

This keeps the math simple, but also helps target the noise in the ~zero part of the spectrum. Otherwise, deviations near zero would count for less than deviations around one.

Go Banana

So I tried it.

With a lot of patience, you can make 2D blue noise images up to 256x256 on a single thread. A naive random search with an FFT for every iteration is not fast, but computers are.

Making a 64x64x16 with this is possible, but it's certainly like watching paint dry. It's the same number of pixels as 256x256, but with an extra dimension worth of FFTs that need to be churned.

Still, it works and you can also make 3D STBN with the spatial and temporal curves controlled independently:

Converged spectra

I built command-line scripts for this, with a bunch of quality of life things. If you're going to sit around waiting for numbers to go by, you have a lot of time for this...

Save and load byte/float-exact state to a .png, save parameters to .json
Save a bunch of debug viz as extra .pngs with every snapshot
Auto-save state periodically during runs
Measure and show rate of convergence every N seconds, with smoothing
Validate the histogram before saving to detect bugs and avoid corrupting expensive runs

I could fire up a couple of workers to start churning, while continuing to develop the code liberally with new variations. I could also stop and restart workers with new heuristics, continuing where it left off.

Protip: you can write C in JavaScript

Drunk with power, I tried various sizes and curves, which created... okay noise. Each has the exact same uniform distribution so it's difficult to judge other than comparing to other output, or earlier versions of itself.

To address this, I visualized the blurred result, using a [1 4 6 4 1] kernel as my base line. After adjusting levels, structure was visible:

Semi-converged

Blurred

The resulting spectra show what's actually going on:

Semi-converged

Blurred

The main component is the expected ring of bandpass noise, the 2D equivalent of ringing. But in between there is also a ton of redder noise, in the lower frequencies, which all remains after a blur. This noise is as strong as the ring.

So while it's easy to make a blue-ish noise pattern that looks okay at first glance, there is a vast gap between having a noise floor and not having one. So I kept iterating:

It takes a very long time, but if you wait, all those little specks will slowly twinkle out, until quantization starts to get in the way, with a loss of about 1/255 per pixel (0.0039).

Semi converged

Fully converged

The effect on the blurred output is remarkable. All the large scale structure disappears, as you'd expect from spectra, leaving only the bandpass ringing. That goes away with a strong enough blur, or a large enough dark zone.

The visual difference between the two is slight, but nevertheless, the difference is significant and pervasive when amplified:

Semi converged

Fully converged

Difference

Final spectrum

I tried a few indigo noise curves, with different % of the curve zero. The resulting noise is all extremely equal, even after a blur and amplify. The only visible noise left is bandpass, and the noise floor is so low it may as well not be there.

As you make the black exclusion zone bigger, the noise gets concentrated in the edges and corners. It becomes a bit more linear and squarish, a contender for violet noise. This is basically a smooth evolution towards a pure pixel checkboard in the limit. Using more than 50% zero seems inadvisable for this reason:

Time domain

Frequency domain

At this point the idea was validated, but it was dog slow. Can it be done faster?

Spatially Sparse

An FFT scales like O(N log N). When you are dealing with images and volumes, that N is actually an N² or N³ in practice.

The early phase of the search is the easiest to speed up, because you can find a good swap for any pixel with barely any tries. There is no point in being clever. Each sub-region is very non-uniform, and its spectrum nearly white. Placing pixels roughly by the right location is plenty good enough.

You might try splitting a large volume into separate blocks, and optimize each block separately. That wouldn't work, because all the boundaries remain fully white. Overlapping doesn't fix this, because they will actively create new seams. I tried it.

What does work is a windowed scoring strategy. It avoids a full FFT for the entire volume, and only scores each NxN or NxNxN region around each swapped point, with N-sized FFTs in each dimension. This is enormously faster and can rapidly dissolve larger volumes of white noise into approximate blue even with e.g. N = 8 or N = 16. Eventually it stops improving and you need to bump the range or switch to a global optimizer.

Here's the progression from white noise, to when sparse 16x16 gives up, followed by some additional 64x64:

Time domain

Frequency domain

Time domain

Frequency domain

Time domain

Frequency domain

A naive solution does not work well however. This is because the spectrum of a subregion does not match the spectrum of the whole.

The Fourier transform assumes each signal is periodic. If you take a random subregion and forcibly repeat it, its new spectrum will have aliasing artifacts. This would cause you to consistently misjudge swaps.

To fix this, you need to window the signal in the space/time-domain. This forces it to start and end at 0, and eliminates the effect of non-matching boundaries on the scoring. I used a smoothStep window because it's cheap, and haven't needed to try anything else:

16x16 windowed data

$$ w(t) = 1 - (3|t|^2 - 2|t|^3) , t=-1..1 $$

This still alters the spectrum, but in a predictable way. A time-domain window is a convolution in the frequency domain. You don't actually have a choice here: not using a window is mathematically equivalent to using a very bad window. It's effectively a box filter covering the cut-out area inside the larger volume, which causes spectral ringing.

The effect of the chosen window on the target spectrum can be modeled via convolution of their spectral magnitudes:

$$ \mathbf{target}' = |\mathbf{target}| \circledast |\mathcal{F}(\mathbf{window})| $$

This can be done via the time domain as:

$$ \mathbf{target}' = \mathcal{F}(\mathcal{F}^{-1}(|\mathbf{target}|) \cdot \mathcal{F}^{-1}(|\mathcal{F}(\mathbf{window})|)) $$

Note that the forward/inverse Fourier pairs are not redundant, as there is an absolute value operator in between. This discards the phase component of the window, which is irrelevant.

Curiously, while it is important to window the noise data, it isn't very important to window the target. The effect of the spectral convolution is small, amounting to a small blur, and the extra error is random and absorbed by the smooth scoring function.

The resulting local loss tracks the global loss function pretty closely. It massively speeds up the search in larger volumes, because the large FFT is the bottleneck. But it stops working well before anything resembling convergence in the frequency-domain. It does not make true blue noise, only a lookalike.

The overall problem is still that we can't tell good swaps from bad swaps without trying them and verifying.

Sleight of Frequency

So, let's characterize the effect of a pixel swap.

Given a signal [A B C D E F G H], let's swap C and F.

Swapping the two values is the same as adding F - C = Δ to C, and subtracting that same delta from F. That is, you add the vector:

V = [0 0 Δ 0 0 -Δ 0 0]

This remains true if you apply a Fourier transform and do it in the frequency domain.

To best understand this, you need to develop some intuition around FFTs of Dirac deltas.

Consider the short filter kernel [1 4 6 4 1]. It's a little known fact, but you can actually sight-read its frequency spectrum directly off the coefficients, because the filter is symmetrical. I will teach you.

The extremes are easy:

The DC amplitude is the sum 1 + 4 + 6 + 4 + 1 = 16
The Nyquist amplitude is the modulated sum 1 - 4 + 6 - 4 + 1 = 0

So we already know it's an 'ideal' lowpass filter, which reduces the Nyquist signal +1, -1, +1, -1, ... to exactly zero. It also has 16x DC gain.

Now all the other frequencies.

First, remember the Fourier transform works in symmetric ways. Every statement "____ in the time domain = ____ in the frequency domain" is still true if you swap the words time and frequency. This has lead to the grotesquely named sub-field of cepstral processing where you have quefrencies and vawes, and it kinda feels like you're having a stroke. The cepstral convolution filter from earlier is called a lifter.

Usually cepstral processing is applied to the real magnitude of the spectrum, i.e. $ |\mathrm{spectrum}| $, instead of its true complex value. This is a coward move.

So, decompose the kernel into symmetric pairs:

[· · 6 · ·]
[· 4 · 4 ·]
[1 · · · 1]

All but the first row is a pair of real Dirac deltas in the time domain. Such a row is normally what you get when you Fourier transform a cosine, i.e.:

$$ \cos \omega = \frac{\mathrm{e}^{i\omega} + \mathrm{e}^{-i\omega}}{2} $$

A cosine in time is a pair of Dirac deltas in the frequency domain. The phase of a (real) cosine is zero, so both its deltas are real.

Now flip it around. The Fourier transform of a pair [x 0 0 ... 0 0 x] is a real cosine in frequency space. Must be true. Each new pair adds a new higher cosine on top of the existing spectrum. For the central [... 0 0 x 0 0 ...] we add a DC term. It's just a Fourier transform in the other direction:

|FFT([1 4 6 4 1])| =

  [· · 6 · ·] => 6 
  [· 4 · 4 ·] => 8 cos(ɷ)
  [1 · · · 1] => 2 cos(2ɷ)
  
 = |6 + 8 cos(ɷ) + 2 cos(2ɷ)|

Normally you have to use the z-transform to analyze a digital filter. But the above is a shortcut. FFTs and inverse FFTs do have opposite phase, but that doesn't matter here because cos(ɷ) = cos(-ɷ).

This works for the symmetric-even case too: you offset the frequencies by half a band, ɷ/2, and there is no DC term in the middle:

|FFT([1 3 3 1])| =

  [· 3 3 ·] => 6 cos(ɷ/2)
  [1 · · 1] => 2 cos(3ɷ/2)

 = |6 cos(ɷ/2) + 2 cos(3ɷ/2)|

So, symmetric filters have spectra that are made up of regular cosines. Now you know.

For the purpose of this trick, we centered the filter around $ t = 0 $. FFTs are typically aligned to array index 0. The difference between the two is however just phase, so it can be disregarded.

What about the delta vector [0 0 Δ 0 0 -Δ 0 0]? It's not symmetric, so we have to decompose it:

V1 = [· · · · · Δ · ·]
V2 = [· · Δ · · · · ·]

V = V2 - V1

Each is now an unpaired Dirac delta. Each vector's Fourier transform is a complex wave $ Δ \cdot \mathrm{e}^{-i \omega k} $ in the frequency domain (the k'th quefrency). It lacks the usual complementary oppositely twisting wave $ Δ \cdot \mathrm{e}^{i \omega k} $, so it's not real-valued. It has constant magnitude Δ and varying phase:

FFT(V1) = [Δ
Δ
Δ
Δ
Δ
Δ
Δ
Δ]
FFT(V2) = [Δ
Δ
Δ
Δ
Δ
Δ
Δ
Δ
]

These are vawes.

The effect of a swap is still just to add FFT(V), aka FFT(V2) - FFT(V1) to the (complex) spectrum. The effect is to transfer energy between all the bands simultaneously. Hence, FFT(V1) and FFT(V2) function as a source and destination mask for the transfer.

However, 'mask' is the wrong word, because the magnitude of $ \mathrm{e}^{i \omega k} $ is always 1. It doesn't have varying amplitude, only varying phase. -FFT(V1) and FFT(V2) define the complex direction in which to add/subtract energy.

When added together their phases interfere constructively or destructively, resulting in an amplitude that varies between 0 and 2Δ: an actual mask. The resulting phase will be halfway between the two, as it's the sum of two equal-length complex numbers.

FFT(V) = [·
Δ
Δ
Δ
Δ
Δ
Δ
Δ
]

For any given pixel A and its delta FFT(V1), it can pair up with other pixels B to form N-1 different interference masks FFT(V2) - FFT(V1). There are N(N-1)/2 unique interference masks, if you account for (A,B) (B,A) symmetry.

Worth pointing out, the FFT of the first index:

FFT([Δ 0 0 0 0 0 0 0]) = [Δ Δ Δ Δ Δ Δ Δ Δ]

This is the DC quefrency, and the fourier symmetry continues to work. Moving values in time causes the vawe's quefrency to change in the frequency domain. This is the upside-down version of how moving energy to another frequency band causes the wave's frequency to change in the time domain.

What's the Gradient, Kenneth?

Using vectors as masks... shifting energy in directions... this means gradient descent, no?

Well.

It's indeed possible to calculate the derivative of your loss function as a function of input pixel brightness, with the usual bag of automatic differentiation/backprop tricks. You can also do it numerically.

But, this doesn't help you directly because the only way you can act on that per-pixel gradient is by swapping a pair of pixels. You need to find two quefrencies FFT(V1) and FFT(V2) which interfere in exactly the right way to decrease the loss function across all bad frequencies simultaneously, while leaving the good ones untouched. Even if the gradient were to help you pick a good starting pixel, that still leaves the problem of finding a good partner.

There are still O(N²) possible pairs to choose from, and the entire spectrum changes a little bit on every swap. Which means new FFTs to analyze it.

Random greedy search is actually tricky to beat in practice. Whatever extra work you spend on getting better samples translates into less samples tried per second. e.g. Taking a best-of-3 approach is worse than just trying 3 swaps in a row. Swaps are almost always orthogonal.

But random() still samples unevenly because it's white noise. If only we had.... oh wait. Indeed if you already have blue noise of the right size, you can use that to mildly speed up the search for more. Use it as a random permutation to drive sampling, with some inevitable whitening over time to keep it fresh. You can't however use the noise you're generating to accelerate its own search, because the two are highly correlated.

What's really going on is all a consequence of the loss function.

Given any particular frequency band, the loss function is only affected when its magnitude changes. Its phase can change arbitrarily, rolling around without friction. The complex gradient must point in the radial direction. In the tangential direction, the partial derivative is zero.

The value of a given interference mask FFT(V1) - FFT(V2) for a given frequency is also complex-valued. It can be projected onto the current phase, and split into its radial and tangential component with a dot product.

The interference mask has a dual action. As we saw, its magnitude varies between 0 and 2Δ, as a function of the two indices k1 and k2. This creates a window that is independent of the specific state or data. It defines a smooth 'hash' from the interference of two quefrency bands.

But its phase adds an additional selection effect: whether the interference in the mask is aligned with the current band's phase: this determines the split between radial and tangential. This defines a smooth phase 'hash' on top. It cycles at the average of the two quefrencies, i.e. a different, third one.

Energy is only added/subtracted if both hashes are non-zero. If the phase hash is zero, the frequency band only turns. This does not affect loss, but changes how each mask will affect it in the future. This then determines how it is coupled to other bands when you perform a particular swap.

Note that this is only true differentially: for a finite swap, the curvature of the complex domain comes into play.

The loss function is actually a hyper-cylindrical skate bowl you can ride around. Just the movement of all the bands is tied together.

Frequency bands with significant error may 'random walk' freely clockwise or counterclockwise when subjected to swaps. A band can therefor drift until it gets a turn where its phase is in alignment with enough similar bands, where the swap makes them all descend along the local gradient, enough to counter any negative effects elsewhere.

In the time domain, each frequency band is a wave that oscillates between -1...1: it 'owns' some of the value of each pixel, but there are places where its weight is ~zero (the knots).

So when a band shifts phase, it changes how much of the energy of each pixel it 'owns'. This allows each band to 'scan' different parts of the noise in the time domain. In order to fix a particular peak or dent in the frequency spectrum, the search must rotate that band's phase so it strongly owns any defect in the noise, and then perform a swap to fix that defect.

Thus, my mental model of this is not actually disconnected pixel swapping.

It's more like one of those Myst puzzles where flipping a switch flips some of the neighbors too. You press one pair of buttons at a time. It's a giant haunted dimmer switch.

We're dealing with complex amplitudes, not real values, so the light also has a color. Mechanically it's like a slot machine, with dials that can rotate to display different sides. The cherries and bells are the color: they determine how the light gets brighter or darker. If a dial is set just right, you can use it as a /dev/null to 'dump' changes.

That's what theory predicts, but does it work? Well, here is a (blurred noise) spectrum being late-optimized. The search is trying to eliminate the left-over lower frequency noise in the middle:

Semi converged

Here's the phase difference from the late stages of search, each a good swap. Left to right shows 4 different value scales:

x16

x256

x4096

x16

x256

x4096

At first it looks like just a few phases are changing, but amplification reveals it's the opposite. There are several plateaus. Strongest are the bands being actively modified. Then there's the circular error area around it, where other bands are still swirling into phase. Then there's a sharp drop-off to a much weaker noise floor, present everywhere. These are the bands that are already converged.

Compare to a random bad swap:

x16

x256

x4096

Now there is strong noise all over the center, and the loss immediately gets worse, as a bunch of amplitudes start shifting in the wrong direction randomly.

So it's true. Applying the swap algorithm with a spectral target naturally cycles through focusing on different parts of the target spectrum as it makes progress. This information is positionally encoded in the phases of the bands and can be 'queried' by attempting a swap.

This means the constraint of a fixed target spectrum is actually a constantly moving target in the complex domain.

Frequency bands that reach the target are locked in. Neither their magnitude nor phase changes in aggregate. The random walks of such bands must have no DC component... they must be complex-valued blue noise with a tiny amplitude.

Knowing this doesn't help directly, but it does explain why the search is so hard. Because the interference masks function like hashes, there is no simple pattern to how positions map to errors in the spectrum. And once you get close to the target, finding new good swaps is equivalent to digging out information encoded deep in the phase domain, with O(N²) interference masks to choose from.

Gradient Sampling

As I was trying to optimize for evenness after blur, it occurred to me to simply try selecting bright or dark spots in the blurred after-image.

This is the situation where frequency bands are in coupled alignment: the error in the spectrum has a relatively concentrated footprint in the time domain. But, this heuristic merely picks out good swaps that are already 'lined up' so to speak. It only works as a low-hanging fruit sampler, with rapidly diminishing returns.

Next I used the gradient in the frequency domain.

The gradient points towards increasing loss, which is the sum of squared distance $ (…)^2 $. So the slope is $ 2(…) $, proportional to distance to the goal:

$$ |\mathrm{gradient}_k| = 2 \cdot \left( \frac{|\mathrm{spectrum}_k|}{||\mathbf{spectrum}||} - \frac{\mathrm{target}_k}{||\mathbf{target}||} \right) $$

It's radial, so its phase matches the spectrum itself:

$$ \mathrm{gradient}_k = \mathrm{|gradient_k|} \cdot \left(1 ∠ \mathrm{arg}(\mathrm{spectrum}_k) \right) $$

Eagle-eyed readers may notice the sqrt part of the L2 norm is missing here. It's only there for normalization, and in fact, you generally want a gradient that decreases the closer you get to the target. It acts as a natural stabilizer, forming a convex optimization problem.

You can transport this gradient backwards by applying an inverse FFT. Usually derivatives and FFTs don't commute, but that's only when you are deriving in the same dimension as the FFT. The partial derivative here is neither over time nor frequency, but by signal value.

The resulting time-domain gradient tells you how fast the (squared) loss would change if a given pixel changed. The sign tells you whether it needs to become lighter or darker. In theory, a pixel with a large gradient can enable larger score improvements per step.

It says little about what's a suitable pixel to pair with though. You can infer that a pixel needs to be paired with one that is brighter or darker, but not how much. The gradient only applies differentially. It involves two pixels, so it will cause interference between the two deltas, and also with the signal's own phase.

The time-domain gradient does change slowly after every swap—mainly the swapping pixels—so this only needs to add an extra IFFT every N swap attempts, reusing it in between.

I tried this in two ways. One was to bias random sampling towards points with the largest gradients. This barely did anything, when applied to one or both pixels.

Then I tried going down the list in order, and this worked better. I tried a bunch of heuristics here, like adding a retry until paired, and a 'dud' tracker to reject known unpairable pixels. It did lead to some minor gains in successful sample selection. But beating random was still not a sure bet in all cases, because it comes at the cost of ordering and tracking all pixels to sample them.

All in all, it was quite mystifying.

Pair Analysis

Hence I analyzed all possible swaps (A,B) inside one 64x64 image at different stages of convergence, for 1024 pixels A (25% of total).

The result was quite illuminating. There are 2 indicators of a pixel's suitability for swapping:

% of all possible swaps (A,_) that are good
score improvement of best possible swap (A,B)

They are highly correlated, and you can take the geometric average to get a single quality score to order by:

The curve shows that the best possible candidates are rare, with a sharp drop-off at the start. Here the average candidate is ~1/3rd as good as the best, though every pixel is pairable. This represents the typical situation when you have unconverged blue-ish noise.

Order all pixels by their (signed) gradient, and plot the quality:

The distribution seems biased towards the ends. A larger absolute gradient at A can indeed lead to both better scores and higher % of good swaps.

Notice that it's also noisier at the ends, where it dips below the middle. If you order pixels by their quality, and then plot the absolute gradient, you see:

Selecting for large gradient at A will select both the best and the worst possible pixels A. This implies that there are pixels in the noise that are very significant, but are nevertheless currently 'unfixable'. This corresponds to the 'focus' described earlier.

By drawing from the 'top', I was mining the imbalance between the good/left and bad/right distribution. Selecting for a vanishing gradient would instead select the average-to-bad pixels A.

I investigated one instance of each: very good, average or very bad pixel A. I tried every possible swap (A, B) and plotted the curve again. Here the quality is just the actual score improvement:

The three scenarios have similar curves, with the bulk of swaps being negative. Only a tiny bit of the curve is sticking out positive, even in the best case. The potential benefit of a good swap is dwarfed by the potential harm of bad swaps. The main difference is just how many positive swaps there are, if any.

So let's focus on the positive case, where you can see best.

You can order by score, and plot the gradient of all the pixels B, to see another correlation.

It looks kinda promising. Here the sign matters, with left and right being different. If the gradient of pixel A is the opposite sign, then this graph is mirrored.

But if you order by (signed) gradient and plot the score, you see the real problem, caused by the noise:

The good samples are mixed freely among the bad ones, with only a very weak trend downward. This explains why sampling improvements based purely on gradient for pixel B are impossible.

You can see what's going on if you plot Δv, the difference in value between A and B:

For a given pixel A, all the good swaps have a similar value for B, which is not unexpected. Its mean is the ideal value for A, but there is a lot of variance. In this case pixel A is nearly white, so it is brighter than almost every other pixel B.

If you now plot Δv * -gradient, you see a clue on the left:

Almost all of the successful swaps have a small but positive value.

This represents what we already knew: the gradient's sign tells you if a pixel should be brighter or darker. If Δv has the opposite sign, the chances of a successful swap are slim.

Ideally both pixels 'face' the right way, so the swap is beneficial on both ends. But only the combined effect on the loss matters: i.e. Δv * Δgradient < 0.

It's only true differentially so it can misfire. But compared to blind sampling of pairs, it's easily 5-10x better and faster, racing towards the tougher parts of the search.

What's more... while this test is just binary, I found that any effort spent on trying to further prioritize swaps by the magnitude of the gradient is entirely wasted. Maximizing Δv * Δgradient by repeated sampling is counterproductive, because it selects more bad candidates on the right. Minimizing Δv * Δgradient creates more successful swaps on the left, but lowers the average improvement per step so the convergence is net slower. Anything more sophisticated incurs too much computation to be worth it.

It does have a limit. This is what it looks like when an image is practically fully converged:

Eventually you reach the point where there are only a handful of swaps with any real benefit, while the rest is just shaving off a few bits of loss at a time. It devolves back to pure random selection, only skipping the coin flip for the gradient. It is likely that more targeted heuristics can still work here.

The gradient also works in the early stages. As it barely changes over successive swaps, this leads to a different kind of sparse mode. Instead of scoring only a subset of pixels, simply score multiple swaps as a group over time, without re-scoring intermediate states. This lowers the success rate roughly by a power (e.g. 0.8 -> 0.64), but cuts the number of FFTs by a constant factor (e.g. 1/2). Early on this trade-off can be worth it.

Even faster: don't score steps at all. In the very early stage, you can easily get up to 80-90% successful swaps just by filtering on values and gradients. If you just swap a bunch in a row, there is a very good chance you will still end up better than before.

It works better than sparse scoring: using the gradient of your true objective approximately works better than using an approximate objective exactly.

The latter will miss the true goal by design, while the former continually re-aims itself to the destination despite inaccuracy.

Obviously you can mix and match techniques, and gradient + sparse is actually a winning combo. I've only scratched the surface here.

Warp Speed

Time to address the elephant in the room. If the main bottleneck is an FFT, wouldn't this work better on a GPU?

The answer to that is an unsurprising yes, at least for large sizes where the overhead of async dispatch is negligible. However, it would have been endlessly more cumbersome to discover all of the above based on a GPU implementation, where I can't just log intermediate values to a console.

After checking everything, I pulled out my bag of tricks and ported it to Use.GPU. As a result, the algorithm runs entirely on the GPU, and provides live visualization of the entire process. It requires a WebGPU-enabled browser, which in practice means Chrome on Windows or Mac, or a dev build elsewhere.

I haven't particularly optimized this—the FFT is vanilla textbook—but it works. It provides an easy ~8x speed up on an M1 Mac on beefier target sizes. With a desktop GPU, 128x128x32 and larger become very feasible.

It lacks a few goodies from the scripts, and only does gradient + optional sparse. You can however freely exchange PNGs between the CPU and GPU version via drag and drop, as long as the settings match.

Layout components

Compute components

Worth pointing out: this visualization is built using Use.GPU's HTML-like layout system. I can put div-like blocks inside a flex box wrapper, and put text beside it... while at the same time using raw WGSL shaders as the contents of those divs. These visualization shaders sample and colorize the algorithm state on the fly, with no CPU-side involvement other than a static dispatch. The only GPU -> CPU readback is for the stats in the corner, which are classic React and real HTML, along with the rest of the controls.

I can then build an component and drop it inside an async , and it does exactly what it should. The rest is just a handful of elements and the ordinary headache of writing compute shaders. ensures all the shaders are compiled before dispatching.

While running, the bulk of the tree is inert, with only a handful of reducers triggering on a loop, causing a mere 7 live components to update per frame. The compute dispatch fights with the normal rendering for GPU resources, so there is an auto-batching mechanism that aims for approximately 30-60 FPS.

The display is fully anti-aliased, including the pixelized data. I'm using the usual per-pixel SDF trickery to do this... it's applied as a generic wrapper shader for any UV-based sampler.

It's a good showcase that Use.GPU really is React-for-GPUs with less hassle, but still with all the goodies. It bypasses most of the browser once the canvas gets going, and it isn't just for UI: you can express async compute just fine with the right component design. The robust layout and vector plotting capabilities are just extra on top.

I won't claim it's the world's most elegant abstraction, because it's far too pragmatic for that. But I simply don't know any other programming environment where I could even try something like this and not get bogged down in GPU binding hell, or have to round-trip everything back to the CPU.

* * *

So there you have it: blue and indigo noise à la carte.

What I find most interesting is that the problem of generating noise in the time domain has been recast into shaping and denoising a spectrum in the frequency domain. It starts as white noise, and gets turned into a pre-designed picture. You do so by swapping pixels in the other domain. The state for this process is kept in the phase channel, which is not directly relevant to the problem, but drifts into alignment over time.

Hence I called it Stable Fiddusion. If you swap the two domains, you're turning noise into a picture by swapping frequency bands without changing their values. It would result in a complex-valued picture, whose magnitude is the target, and whose phase encodes the progress of the convergence process.

This is approximately what you get when you add a hidden layer to a diffusion model.

What I also find interesting is that the notion of swaps naturally creates a space that is O(N²) big with only N samples of actual data. Viewed from the perspective of a single step, every pair (A,B) corresponds to a unique information mask in the frequency domain that extracts a unique delta from the same data. There is redundancy, of course, but the nature of the Fourier transform smears it out into one big superposition. When you do multiple swaps, the space grows, but not quite that fast: any permutation of the same non-overlapping swaps is equivalent. There is also a notion of entanglement: frequency bands / pixels are linked together to move as a whole by default, but parts will diffuse into being locked in place.

Phase is kind of the bugbear of the DSP world. Everyone knows it's there, but they prefer not to talk about it unless its content is neat and simple. Hopefully by now you have a better appreciation of the true nature of a Fourier transform. Not just as a spectrum for a real-valued signal, but as a complex-valued transform of a complex-valued input.

During a swap run, the phase channel continuously looks like noise, but is actually highly structured when queried with the right quefrency hashes. I wonder what other things look like that, when you flip them around.

More:

Sub-pixel Distance Transform

2023-07-17T00:00:00+02:00

High quality font rendering for WebGPU

This page includes diagrams in WebGPU, which has limited browser support. For the full experience, use Chrome on Windows or Mac, or a developer build on other platforms.

In this post I will describe Use.GPU's text rendering, which uses a bespoke approach to Signed Distance Fields (SDFs). This was borne out of necessity: while SDF text is pretty common on GPUs, some of the established practice on generating SDFs from masks is incorrect, and some libraries get it right only by accident. So this will be a deep dive from first principles, about the nuances of subpixels.

SDFs

The idea behind SDFs is quite simple. To draw a crisp, anti-aliased shape at any size, you start from a field or image that records the distance to the shape's edge at every point, as a gradient. Lighter grays are inside, darker grays are outside. This can be a lower resolution than the target.

Then you increase the contrast until the gradient is exactly 1 pixel wide at the target size. You can sample it to get a perfectly anti-aliased opacity mask:

This works fine for text at typical sizes, and handles fractional shifts and scales perfectly with zero shimmering. It's also reasonably correct from a signal processing math point-of-view: it closely approximates averaging over a pixel-sized circular window, i.e. a low-pass convolution.

Crucially, it takes a rendered glyph as input, which means I can remain blissfully unaware of TrueType font specifics, and bezier rasterization, and just offload that to an existing library.

To generate an SDF, I started with MapBox's TinySDF library. Except, what comes out of it is wrong:

The contours are noticeably wobbly and pixelated. The only reason the glyph itself looks okay is because the errors around the zero-level are symmetrical and cancel out. If you try to dilate or contract the outline, which is supposed to be one of SDF's killer features, you get ugly gunk.

Compare to:

The original Valve paper glosses over this aspect and uses high resolution inputs (4k) for a highly downscaled result (64). That is not an option for me because it's too slow. But I did get it to work. As a result Use.GPU has a novel subpixel-accurate distance transform (ESDT), which even does emoji. It's a combination CPU/GPU approach, with the CPU generating SDFs and the GPU rendering them, including all the debug viz.

The Classic EDT

The common solution is a Euclidean Distance Transform. Given a binary mask, it will produce an unsigned distance field. This holds the squared distance d² for either the inside or outside area, which you can sqrt.

Like a Fourier Transform, you can apply it to 2D images by applying it horizontally on each row X, then vertically on each column Y (or vice versa). To make a signed distance field, you do this for both the inside and outside separately, and then combine the two as inside – outside or vice versa.

The algorithm is one of those clever bits of 80s-style C code which is O(N), has lots of 1-letter variable names, and is very CPU cache friendly. Often copy/pasted, but rarely understood. In TypeScript it looks like this, where array is modified in-place and f, v and z are temporary buffers up to 1 row/column long. The arguments offset and stride allow the code to be used in either the X or Y direction in a flattened 2D array.

for (let q = 1, k = 0, s = 0; q < length; q++) {
  f[q] = array[offset + q * stride];

  do {
    let r = v[k];
    s = (f[q] - f[r] + q * q - r * r) / (q - r) / 2;
  } while (s <= z[k] && --k > -1);

  k++;
  v[k] = q;
  z[k] = s;
  z[k + 1] = INF;
}

for (let q = 0, k = 0; q < length; q++) {
  while (z[k + 1] < q) k++;
  let r = v[k];
  let d = q - r;
  array[offset + q * stride] = f[r] + d * d;
}

To explain what this code does, let's start with a naive version instead.

Given a 1D input array of zeroes (filled), with an area masked out with infinity (empty):

O = [·, ·, ·, 0, 0, 0, 0, 0, ·, 0, 0, 0, ·, ·, ·]

Make a matching sequence … 3 2 1 0 1 2 3 … for each element, centering the 0 at each index:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14] + ∞
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13] + ∞
[2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12] + ∞
[3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11] + 0
[4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10] + 0
[5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9] + 0
[6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8] + 0
[7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7] + 0
[8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6] + ∞
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5] + 0
[10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4] + 0
[11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3] + 0
[12,11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2] + ∞
[13,12,11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1] + ∞
[14,13,12,11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0] + ∞

You then add the value from the array to each element in the row:

[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]
[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]
[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]
[3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11]
[4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10]
[5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8]
[7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7]
[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5]
[10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4]
[11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3]
[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]
[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]
[∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞]

And then take the minimum of each column:

P = [3, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 3]

This sequence counts up inside the masked out area, away from the zeroes. This is the positive distance field P.

You can do the same for the inverted mask:

I = [0, 0, 0, ·, ·, ·, ·, ·, 0, ·, ·, ·, 0, 0, 0]

to get the complementary area, i.e. the negative distance field N:

N = [0, 0, 0, 1, 2, 3, 2, 1, 0, 1, 2, 1, 0, 0, 0]

That's what the EDT does, except it uses square distance … 9 4 1 0 1 4 9 …:

When you apply it a second time in the second dimension, these outputs are the new input, i.e. values other than 0 or ∞. It still works because of Pythagoras' rule: d² = x² + y². This wouldn't be true if it used linear distance instead. The net effect is that you end up intersecting a series of parabolas, somewhat like a 1D slice of a Voronoi diagram:

I' = [0, 0, 1, 4, 9, 4, 4, 4, 1, 1, 4, 9, 4, 9, 9]

Each parabola sitting above zero is the 'shadow' of a zero-level paraboloid located some distance in a perpendicular dimension:

The code is just a more clever way to do that, without generating the entire N² grid per row/column. It instead scans through the array left to right, building up a list v[k] of significant minima, with thresholds s[k] where two parabolas intersect. It adds them as candidates (k++) and discards them (--k) if they are eclipsed by a newer value. This is the first for/while loop:

for (let q = 1, k = 0, s = 0; q < length; q++) {
  f[q] = array[offset + q * stride];

  do {
    let r = v[k];
    s = (f[q] - f[r] + q * q - r * r) / (q - r) / 2;
  } while (s <= z[k] && --k > -1);

  k++;
  v[k] = q;
  z[k] = s;
  z[k + 1] = INF;
}

Then it goes left to right again (for), and fills out the values, skipping ahead to the right minimum (k++). This is the squared distance from the current index q to the nearest minimum r, plus the minimum's value f[r] itself. The paper has more details:

for (let q = 0, k = 0; q < length; q++) {
  while (z[k + 1] < q) k++;
  let r = v[k];
  let d = q - r;
  array[offset + q * stride] = f[r] + d * d;
}

The Broken EDT

So what's the catch? The above assumes a binary mask.

As it happens, if you try to subtract a binary N from P, you have an off-by-one error:

    O = [·, ·, ·, 0, 0, 0, 0, 0, ·, 0, 0, 0, ·, ·, ·]
    I = [0, 0, 0, ·, ·, ·, ·, ·, 0, ·, ·, ·, 0, 0, 0]

    P = [3, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 3]
    N = [0, 0, 0, 1, 2, 3, 2, 1, 0, 1, 2, 1, 0, 0, 0]

P - N = [3, 2, 1,-1,-2,-3,-2,-1, 1,-1,-2,-1, 1, 2, 3]

It goes directly from 1 to -1 and back. You could add +/- 0.5 to fix that.

But if there is a gray pixel in between each white and black, which we classify as both inside (0) and outside (0), it seems to work out just fine:

    O = [·, ·, ·, 0, 0, 0, 0, 0, ·, 0, 0, 0, ·, ·, ·]
    I = [0, 0, 0, 0, ·, ·, ·, 0, 0, 0, ·, 0, 0, 0, 0]

    P = [3, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 3]
    N = [0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0]

P - N = [3, 2, 1, 0,-1,-2,-1, 0, 1, 0,-1, 0, 1, 2, 3]

This is a realization that somebody must've had, and they reasoned on: "The above is correct for a 50% opaque pixel, where the edge between inside and outside falls exactly in the middle of a pixel."

"Lighter grays are more inside, and darker grays are more outside. So all we need to do is treat l = level - 0.5 as a signed distance, and use l² for the initial inside or outside value for gray pixels. This will cause either the positive or negative distance field to shift by a subpixel amount l. And then the EDT will propagate this in both X and Y directions."

The initial idea is correct, because this is just running SDF rendering in reverse. A gray pixel in an opacity mask is what you get when you contrast adjust an SDF and do not blow it out into pure black or white. The information inside the gray pixels is "correct", up to rounding.

But there are two mistakes here.

The first is that even in an anti-aliased image, you can have white pixels right next to black ones. Especially with fonts, which are pixel-hinted. So the SDF is wrong there, because it goes directly from -1 to 1. This causes the contours to double up, e.g. around this bottom edge:

To solve this, you can eliminate the crisp case by deliberately making those edges very dark or light gray.

But the second mistake is more subtle. The EDT works in 2D because you can feed the output of X in as the input of Y. But that means that any non-zero input to X represents another dimension Z, separate from X and Y. The resulting squared distance will be x² + y² + z². This is a 3D distance, not 2D.

If an edge is shifted by 0.5 pixels in X, you would expect a 1D SDF like:

  […, 0.5, 1.5, 2.5, 3.5, …]
= […, 0.5, 1 + 0.5, 2 + 0.5, 3 + 0.5, …]

Instead, because of the squaring + square root, you will get:

  […, 0.5, 1.12, 2.06, 3.04, …]
= […, sqrt(0.25), sqrt(1 + 0.25), sqrt(4 + 0.25), sqrt(9 + 0.25), …]

The effect of l² = 0.25 rapidly diminishes as you get away from the edge, and is significantly wrong even just one pixel over.

The correct shift would need to be folded into (x + …)² + (y + …)² and depends on the direction. e.g. If an edge is shifted horizontally, it ought to be (x + l)² + y², which means there is a term of 2*x*l missing. If the shift is vertical, it's 2*y*l instead. This is also a signed value, not positive/unsigned.

Given all this, it's a miracle this worked at all. The only reason this isn't more visible in the final glyph is because the positive and negative fields contains the same but opposite errors around their respective gray pixels.

The Not-Subpixel EDT

As mentioned before, the EDT algorithm is essentially making a 1D Voronoi diagram every time. It finds the distance to the nearest minimum for every array index. But there is no reason for those minima themselves to lie at integer offsets, because the second for loop effectively resamples the data.

So you can take an input mask, and tag each index with a horizontal offset Δ:

O = [·, ·, ·, 0, 0, 0, 0, 0, ·, ·, ·]
Δ = [A, B, C, D, E, F, G, H, I, J, K]

As long as the offsets are small, no two indices will swap order, and the code still works. You then build the Voronoi diagram out of the shifted parabolas, but sample the result at unshifted indices.

Problem 1 - Opposing Shifts

This lead me down the first rabbit hole, which was an attempt to make the EDT subpixel capable without losing its appealing simplicity. I started by investigating the nuances of subpixel EDT in 1D. This was a bit of a mirage, because most real problems only pop up in 2D. Though there was one important insight here.

O = [·, ·, ·, 0, 0, 0, 0, 0, ·, ·, ·]
Δ = [·, ·, ·, A, ·, ·, ·, B, ·, ·, ·]

Given a mask of zeroes and infinities, you can only shift the first and last point of each segment. Infinities don't do anything, while middle zeroes should remain zero.

Using an offset A works sort of as expected: this will increase or decrease the values filled in by a fractional pixel, calculating a squared distance (d + A)² where A can be positive or negative. But the value at the shifted index itself is always (0 + A)² (positive). This means it is always outside, regardless of whether it is moving left or right.

If A is moving left (–), the point is inside, and the (unsigned) distance should be 0. At B the situation is reversed: the distance should be 0 if B is moving right (+). It might seem like this is annoying but fixable, because the zeroes can be filled in by the opposite signed field. But this is only looking at the binary 1D case, where there are only zeroes and infinities.

In 2D, a second pass has non-zero distances, so every index can be shifted:

O = [a, b, c, d, e, f, g, h, i, j, k]
Δ = [A, B, C, D, E, F, G, H, I, J, K]

Now, resolving every subpixel unambiguously is harder than you might think:

It's important to notice that the function being sampled by an EDT is not actually smooth: it is the minimum of a discrete set of parabolas, which cross at an angle. The square root of the output only produces a smooth linear gradient because it samples each parabola at integer offsets. Each center only shifts upward by the square of an integer in every pass, so the crossings are predictable. You never sample the 'wrong' side of (d + ...)². A subpixel EDT does not have this luxury.

Subpixel EDTs are not irreparably broken though. Rather, they are only valid if they cause the unsigned distance field to increase, i.e. if they dilate the empty space. This is a problem: any shift that dilates the positive field contracts the negative, and vice versa.

To fix this, you need to get out of the handwaving stage and actually understand P and N as continuous 2D fields.

Problem 2 - Diagonals

Consider an aliased, sloped edge. To understand how the classic EDT resolves it, we can turn it into a voronoi diagram for all the white pixel centers:

Near the bottom, the field is dominated by the white pixels on the corners: they form diagonal sections downwards. Near the edge itself, the field runs perfectly vertical inside a roughly triangular section. In both cases, an arrow pointing back towards the cell center is only vaguely perpendicular to the true diagonal edge.

Near perfect diagonals, the edge distances are just wrong. The distance of edge pixels goes up or right (1), rather than the more logical diagonal 0.707…. The true closest point on the edge is not part of the grid.

These fields don't really resolve properly until 6-7 pixels out. You could hide these flaws with e.g. an 8x downscale, but that's 64x more pixels. Either way, you shouldn't expect perfect numerical accuracy from an EDT. Just because it's mathematically separable doesn't mean it's particularly good.

In fact, it's only separable because it isn't very good at all.

Problem 3 - Gradients

In 2D, there is also only one correct answer to the gray case. Consider a diagonal edge, anti-aliased:

Thresholding it into black, grey or white, you get:

If you now classify the grays as both inside and outside, then the highlighted pixels will be part of both masks. Both the positive and negative field will be exactly zero there, and so will the SDF (P - N):

This creates a phantom vertical edge that pushes apart P and N, and causes the average slope to be less than 45º. The field simply has the wrong shape, because gray pixels can be surrounded by other gray pixels.

This also explains why TinySDF magically seemed to work despite being so pixelized. The l² gray correction fills in exactly the gaps in the bad (P - N) field where it is zero, and it interpolates towards a symmetrically wrong P and N field on each side.

If we instead classify grays as neither inside nor outside, then P and N overlap in the boundary, and it is possible to resolve them into a coherent SDF with a clean 45 degree slope, if you do it right:

What seemed like an off-by-one error is actually the right approach in 2D or higher. The subpixel SDF will then be a modified version of this field, where the P and N sides are changed in lock-step to remain mutually consistent.

Though we will get there in a roundabout way.

Problem 4 - Commuting

It's worth pointing out: a subpixel EDT simply cannot commute in 2D.

First, consider the data flow of an ordinary EDT:

Information from a corner pixel can flow through empty space both when doing X-then-Y and Y-then-X. But information from the horizontal edge pixels can only flow vertically then horizontally. This is okay because the separating lines between adjacent pixels are purely vertical too: the red arrows never 'win'.

But if you introduce subpixel shifts, the separating lines can turn:

The data flow is still limited to the original EDT pattern, so the edge pixels at the top can only propagate by starting downward. They can only influence adjacent columns if the order is Y-then-X. For vertical edges it's the opposite.

That said, this is only a problem on shallow concave curves, where there aren't any corner pixels nearby. The error is that it 'snaps' to the wrong edge point, but only when it is already several pixels away from the edge. In that case, the smaller x² term is dwarfed by the much larger y² term, so the absolute error is small after sqrt.

The ESDT

Knowing all this, here's how I assembled a "true" Euclidean Subpixel Distance Transform.

Subpixel offsets

To start we need to determine the subpixel offsets. We can still treat level - 0.5 as the signed distance for any gray pixel, and ignore all white and black for now.

The tricky part is determining the exact direction of that distance. As an approximation, we can examine the 3x3 neighborhood around each gray pixel and do a least-squares fit of a plane. As long as there is at least one white and one black pixel in this neighborhood, we get a vector pointing towards where the actual edge is. In practice I apply some horizontal/vertical smoothing here using a simple [1 2 1] kernel.

The result is numerically very stable, because the originally rasterized image is visually consistent.

This logic is disabled for thin creases and spikes, where it doesn't work. Such points are treated as fully masked out, so that neighboring distances propagate there instead. This is needed e.g. for the pointy negative space of a W to come out right.

I also implemented a relaxation step that will smooth neighboring vectors if they point in similar directions. However, the effect is quite minimal, and it rounds very sharp corners, so I ended up disabling it by default.

The goal is then to do an ESDT that uses these shifted positions for the minima, to get a subpixel accurate distance field.

P and N junction

We saw earlier that only non-masked pixels can have offsets that influence the output (#1). We only have offsets for gray pixels, yet we concluded that gray pixels should be masked out, to form a connected SDF with the right shape (#3). This can't work.

SDFs are both the problem and the solution here. Dilating and contracting SDFs is easy: add or subtract a constant. So you can expand both P and N fields ahead of time geometrically, and then undo it numerically. This is done by pushing their respective gray pixel centers in opposite directions, by half a pixel, on top of the originally calculated offset:

This way, they can remain masked in in both fields, but are always pushed between 0 and 1 pixel inwards. The distance between the P and N gray pixel offsets is always exactly 1, so the non-zero overlap between P and N is guaranteed to be exactly 1 pixel wide everywhere. It's a perfect join anywhere we sample it, because the line between the two ends crosses through a pixel center.

When we then calculate the final SDF, we do the opposite, shifting each by half a pixel and trimming it off with a max:

SDF = max(0, P - 0.5) - max(0, N - 0.5)

Only one of P or N will be > 0.5 at a time, so this is exact.

To deal with pure black/white edges, I treat any black neighbor of a white pixel (horizontal or vertical only) as gray with a 0.5 pixel offset (before P/N dilation). No actual blurring needs to happen, and the result is numerically exact minus epsilon, which is nice.

ESDT state

The state for the ESDT then consists of remembering a signed X and Y offset for every pixel, rather than the squared distance. These are factored into the distance and threshold calculations, separated into its proper parallel and orthogonal components, i.e. X/Y or Y/X. Unlike an EDT, each X or Y pass has to be aware of both axes. But the algorithm is mostly unchanged otherwise, here X-then-Y.

The X pass:

At the start, only gray pixels have offsets and they are all in the range -1…1 (exclusive). With each pass of ESDT, a winning minima's offsets propagate to its affecting range, tracking the total distance (Δx, Δy) (> 1). At the end, each pixel's offset points to the nearest edge, so the squared distance can be derived as Δx² + Δy².

The Y pass:

You can see that the vertical distances in the top-left are practically vertical, and not oriented perpendicular to the contour on average: they have not had a chance to propagate horizontally. But they do factor in the vertical subpixel offset, and this is the dominant component. So even without correction it still creates a smooth SDF with a surprisingly small error.

Fix ups

The commutativity errors are all biased positively, meaning we get an upper bound of the true distance field.

You could take the min of X then Y and Y then X. This would re-use all the same prep and would restore rotation-independence at the cost of 2x as many ESDTs. You could try X then Y then X at 1.5x cost with some hacks. But neither would improve diagonal areas, which were still busted in the original EDT.

Instead I implemented an additional relaxation pass. It visits every pixel's target, and double checks whether one of the 4 immediate neighbors (with subpixel offset) isn't a better solution:

It's a good heuristic because if the target is >1px off there is either a viable commutative propagation path, or you're so far away the error is negligible. It fixes up the diagonals, creating tidy lines when the resolution allows for it:

You could take this even further, given that you know the offsets are supposed to be perpendicular to the glyph contour. You could add reprojection with a few dot products here, but getting it to not misfire on edge cases would be tricky.

While you can tell the unrelaxed offsets are wrong when visualized, and the fixed ones are better, the visual difference in the output glyphs is tiny. You need to blow glyphs up to enormous size to see the difference side by side. So it too is disabled by default. The diagonals in the original EDT were wrong too and you could barely tell.

Emoji

An emoji is generally stored as a full color transparent PNG or SVG. The ESDT can be applied directly to its opacity mask to get an SDF, so no problem there.

There are an extremely rare handful of emoji with semi-transparent areas, but you can get away with making those solid. For this I just use a filter that detects '+' shaped arrangements of pixels that have (almost) the same transparency level. Then I dilate those by 3x3 to get the average transparency level in each area. Then I divide by it to only keep the anti-aliased edges transparent.

The real issue is blending the colors at the edges, when the emoji is being rendered and scaled. The RGB color of transparent pixels is undefined, so whatever values are there will blend into the surrounding pixels, e.g. creating a subtle black halo:

Not Premultiplied

Premultiplied

A common solution is premultiplied alpha. The opacity is baked into the RGB channels as (R * A, G * A, B * A, A), and transparent areas must be fully transparent black. This allows you to use a premultiplied blend mode where the RGB channels are added directly without further scaling, to cancel out the error.

But the alpha channel of an SDF glyph is dynamic, and is independent of the colors, so it cannot be premultiplied. We need valid color values even for the fully transparent areas, so that up- or downscaling is still clean.

Luckily the ESDT calculates X and Y offsets which point from each pixel directly to the nearest edge. We can use them to propagate the colors outward in a single pass, filling in the entire background. It doesn't need to be very accurate, so no filtering is required.

RGB channel

Output

The result looks pretty great. At normal sizes, the crisp edge hides the fact that the interior is somewhat blurry. Emoji fonts are supported via the underlying ab_glyph library, but are too big for the web (10MB+). So you can just load .PNGs on demand instead, at whatever resolution you need. Hooking it up to the 2D canvas to render native system emoji is left as an exercise for the reader.

Use.GPU does not support complex Unicode scripts or RTL text yet—both are a can of worms I wish to offload too—but it does support composite emoji like "pirate flag" (white flag + skull and crossbones) or "male astronaut" (astronaut + man) when formatted using the usual Zero-Width Joiners (U+200D) or modifiers.

Shading

Finally, a note on how to actually render with SDFs, which is more nuanced than you might think.

I pack all the SDF glyphs into an atlas on-demand, the same I use elsewhere in Use.GPU. This has a custom layout algorithm that doesn't backtrack, optimized for filling out a layout at run-time with pieces of a similar size. Glyphs are rasterized at 1.5x their normal font size, after rounding up to the nearest power of two. The extra 50% ensures small fonts on low-DPI displays still use a higher quality SDF, while high-DPI displays just upscale that SDF without noticeable quality loss. The rounding ensures similar font sizes reuse the same SDFs. You can also override the detail independent of font size.

To determine the contrast factor to draw an SDF, you generally use screen-space derivatives. There are good and bad ways of doing this. Your goal is to get a ratio of SDF pixels to screen pixels, so the best thing to do is give the GPU the coordinates of the SDF texture pixels, and ask it to calculate the difference for that between neighboring screen pixels. This works for surfaces in 3D at an angle too. Bad ways of doing this will instead work off relative texture coordinates, and introduce additional scaling factors based on the view or atlas size, when they are all just supposed to cancel out.

As you then adjust the contrast of an SDF to render it, it's important to do so around the zero-level. The glyph's ideal vector shape should not expand or contract as you scale it. Like TinySDF, I use 75% gray as the zero level, so that more SDF range is allocated to the outside than the inside, as dilating glyphs is much more common than contraction.

At the same time, a pixel whose center sits exactly on the zero level edge is actually half inside, half outside, i.e. 50% opaque. So, after scaling the SDF, you need to add 0.5 to the value to get the correct opacity for a blend. This gives you a mathematically accurate font rendering that approximates convolution with a pixel-sized circle or box.

But I go further. Fonts were not invented for screens, they were designed for paper, with ink that bleeds. Certain renderers, e.g. MacOS, replicate this effect. The physical bleed distance is relatively constant, so the larger the font, the smaller the effect of the bleed proportionally. I got the best results with a 0.25 pixel bleed at 32px or more. For smaller sizes, it tapers off linearly. When you zoom out blocks of text, they get subtly fatter instead of thinning out, and this is actually a great effect when viewing document thumbnails, where lines of text become a solid mass at the point where the SDF resolution fails anyway.

In Use.GPU I prefer to use gamma correct, linear RGB color, even for 2D. What surprised me the most is just how unquestionably superior this looks. Text looks rock solid and readable even at small sizes on low-DPI. Because the SDF scales, there is no true font hinting, but it really doesn't need it, it would just be a nice extra.

Presumably you could track hinted points or edges inside SDF glyphs and then do a dynamic distortion somehow, but this is an order of magnitude more complex than what it is doing now, which is splat a contrasted texture on screen. It does have snapping you can turn on, which avoids jiggling of individual letters. But if you turn it off, you get smooth subpixel everything:

I was always a big fan of the 3x1 subpixel rendering used on color LCDs (i.e. ClearType and the like), and I was sad when it was phased out due to the popularity of high-DPI displays. But it turns out the 3x res only offered marginal benefits... the real improvement was always that it had a custom gamma correct blend mode, which is a thing a lot of people still get wrong. Even without RGB subpixels, gamma correct AA looks great. Converting the entire desktop to Linear RGB is also not going to happen in our lifetime, but I really want it more now.

The "blurry text" that some people associate with anti-aliasing is usually just text blended with the wrong gamma curve, and without an appropriate bleed for the font in question.

* * *

If you want to make SDFs from existing input data, subpixel accuracy is crucial. Without it, fully crisp strokes actually become uneven, diagonals can look bumpy, and you cannot make clean dilated outlines or shadows. If you use an EDT, you have to start from a high resolution source and then downscale away all the errors near the edges. But if you use an ESDT, you can upscale even emoji PNGs with decent results.

It might seem pretty obvious in hindsight, but there is a massive difference between getting it sort of working, and actually getting all the details right. There were many false starts and dead ends, because subpixel accuracy also means one bad pixel ruins it.

In some circles, SDF text is an old party trick by now... but a solid and reliable implementation is still a fair amount of work, with very little to go off for the harder parts.

By the way, I did see if I could use voronoi techniques directly, but in terms of computation it is much more involved. Pretty tho:

The ESDT is fast enough to use at run-time, and the implementation is available as a stand-alone import for drop-in use.

This post started as a single live WebGPU diagram, which you can play around with. The source code for all the diagrams is available too.

Fuck It, We'll Do It Live

2023-05-25T00:00:00+02:00

How the Live effect run-time is implemented

In this post I describe how the Live run-time internals are implemented, which drive Use.GPU. Some pre-existing React and FP effect knowledge is useful.

I have written about Live before, but in general terms. You may therefor have a wrong impression of this endeavor.

When a junior engineer sees an application doing complex things, they're often intimidated by the prospect of working on it. They assume that complex functionality must be the result of complexity in code. The fancier the app, the less understandable it must be. This is what their experience has been so far: seniority correlates to more and hairier lines of code.

After 30 years of coding though, I know it's actually the inverse. You cannot get complex functionality working if you've wasted all your complexity points on the code itself. This is the main thing I want to show here, because this post mainly describes 1 data structure and a handful of methods.

Live has a real-time inspector, so a lot of this can be demonstrated live. Reading this on a phone is not recommended, the inspector is made for grown-ups.

The story so far:

The main mechanism of Live is to allow a tree to expand recursively like in React, doing breadth-first expansion. This happens incrementally, and in a reactive, rewindable way. You use this to let interactive programs knit themselves together at run-time, based on the input data.

Like a simple CLI program with a main() function, the code runs top to bottom, and then stops. It produces a finite execution trace that you can inspect. To become interactive and to animate, the run-time will selectively rewind, and re-run only parts, in response to external events. It's a fusion of immediate and retained mode UI, offering the benefits of both and the downsides of neither, not limited to UI.

This relies heavily on FP principles such as pure functions, immutable data structures and so on. But the run-time itself is very mutable: the whole idea is to centralize all the difficult parts of tracking changes in one place, and then forget about them.

Live has no dependencies other than a JavaScript engine and these days consists of ~3600 lines.

If you're still not quite sure what the Live component tree actually is, it's 3 things at once:

a data dependency graph
an execution trace
a tree-shaped cache

The properties of the software emerge because these aspects are fully aligned inside a LiveComponent.

Functionally Mayonnaise

You can approach this from two sides, either from the UI side, or from the side of functional Effects.

Live Components

A LiveComponent (LC) is a React UI function component (FC) with 1 letter changed, at first:

const MyComponent: LC = (props: MyProps) => {
  const {wat} = props;

  // A memo hook
  // Takes dependencies as array
  const slow = useMemo(() => expensiveComputation(wat), [wat]);

  // Some local state
  const [state, setState] = useState(1);
  
  // JSX expressions with props and children
  // These are all names of LC functions to call + their props/arguments
  return (
    
      
      
    
  );
};

The data is immutable, and the rendering appears stateless: it returns a pure data structure for given input props and current state. The component uses hooks to access and manipulate its own state. The run-time will unwrap the outer layer of the onion, mount and reconcile it, and then recurse.

let _ = await (
  
    
    
  
);
return null;

The code is actually misleading though. Both in Live and React, the return keyword here is technically wrong. Return implies passing a value back to a parent, but this is not happening at all. A parent component decided to render , yes. But the function itself is being called by Live/React. it's yielding JSX to the Live/React run-time to make a call to OtherComponent(...). There is no actual return value.

Because a can't return a value to its parent, the received _ will always be null too. The data flow is one-way, from parent to child.

Effects

An Effect is basically just a Promise/Future as a pure value. To first approximation, it's a () => Promise: a promise that doesn't actually start unless you call it a second time. Just like a JSX tag is like a React/Live component waiting to be called. An Effect resolves asynchronously to a new Effect, just like will render more . Unlike a Promise, an Effect is re-usable: you can fire it as many times as you like. Just like you can keep rendering the same .

let value = yield (
  OtherEffect([
    Foo(foo),
    Bar(),
  ])
);
// ...
return value;

So React is like an incomplete functional Effect system. Just replace the word Component with Effect. OtherEffect is then some kind of decorator which describes a parallel dispatch to Effects Foo and Bar. A real Effect system will fork, but then join back, gathering the returned values, like a real return statement.

Unlike React components, Effects are ephemeral: no state is retained after they finish. The purity is actually what makes them appealing in production, to manage complex async flows. They're also not incremental/rewindable: they always run from start to finish.

React	✅	❌	✅	✅
	Pure	Returns	State	Incremental
Effects	✅	✅	❌	❌

You either take an effect system and make it incremental and stateful, or you take React and add the missing return data path

I chose the latter option. First, because hooks are an excellent emulsifier. Second, because the big lesson from React is that plain, old, indexed arrays are kryptonite for incremental code. Unless you've deliberately learned how to avoid them, you won't get far, so it's better to start from that side.

This breakdown is divided into three main parts:

the rendering of 1 component
the environment around a component
the overall tree update loop

Components

The component model revolves around a few core concepts:

Fibers
Hooks and State
Mounts
Memoization
Inlining

Components form the "user land" of Live. You can do everything you need there without ever calling directly into the run-time's "kernel".

Live however does not shield its internals. This is fine, because I don't employ hundreds of junior engineers who would gleefully turn that privilege into a cluster bomb of spaghetti. The run-time is not extensible anyhow: what you see is what you get. The escape hatch is there to support testing and debugging.

Shielding this would be a game of hide-the-reference, creating a shadow-API for privileged friend packages, and so on. Ain't nobody got time for that.

React has an export called DONT_USE_THIS_OR_YOU_WILL_BE_FIRED, Live has THIS_WAY___IF_YOU_DARE and it's called useFiber.

Fibers

Borrowing React terminology, a mounted Component function is called a fiber, despite this being single threaded.

Each persists for the component lifetime. To start, you call render(). This creates and renders the first fiber.

type LiveFiber = {
  // Fiber ID
  id: number,

  // Component function
  f: Function,

  // Arguments (props, etc.)
  args: any[],

  // ...
}

Fibers are numbered with increasing IDs. In JS this means you can create 2⁵³ fibers before it crashes, which ought to be enough for anybody.

It holds the component function f and its latest arguments args. Unlike React, Live functions aren't limited to only a single props argument.

Each fiber is rendered from a tag, which is a plain data structure. The Live version is very simple.

type Key = number | string;
type JSX.Element = {
  // Same as fiber
  f: Function,
  args: any[],

  // Element key={...}
  key?: string | number,

  // Rendered by fiber ID
  by: number,
}

Another name for this type is a DeferredCall. This is much leaner than React's JSX type, although Live will gracefully accept either. In Live, JSX syntax is also optional, as you can write use(Component, …) instead of .

Calls and fibers track the ID by of the fiber that rendered them. This is always an ancestor, but not necessarily the direct parent.

fiber.bound = () => {
  enterFiber(fiber);

  const {f, args} = fiber;
  const jsx = f(...args);

  exitFiber(fiber);

  return jsx;
};

The fiber holds a function bound. This binds f to the fiber itself, always using the current fiber.args as arguments. It wraps the call in an enter and exit function for state housekeeping.

This can then be called via renderFiber(fiber) to get jsx. This is only done during an ongoing render cycle.

{
  // ...

  state: any[],
  pointer: number,
}

Hooks and State

Each fiber holds a local state array and a temporary pointer:

Calling a hook like useState taps into this state without an explicit reference to it.

In Live, this is implemented as a global currentFiber variable, combined with a local fiber.pointer starting at 0. Both are initialized by enterFiber(fiber).

The state array holds flattened triplets, one per hook. They're arranged as [hookType, A, B]. Values A and B are hook-specific, but usually hold a value and a dependencies array. In the case useState, it's just the [value, setValue] pair.

The fiber.pointer advances by 3 slots every time a hook is called. Tracking the hookType allows the run-time to warn you if you call hooks in a different order than before.

The basic React hooks don't need any more state than this and can be implemented in ~20 lines of code each. This is useMemo:

export const useMemo = (
  callback: () => T,
  dependencies: any[] = NO_DEPS,
): T => {
  const fiber = useFiber();

  const i = pushState(fiber, Hook.MEMO);
  let {state} = fiber;

  let value = state![i];
  const deps = state![i + 1];

  if (!isSameDependencies(deps, dependencies)) {
    value = callback();

    state![i] = value;
    state![i + 1] = dependencies;
  }

  return value as unknown as T;
}

useFiber just returns currentFiber and doesn't count as a real hook (it has no state). It only ensures you cannot call a hook outside of a component render.

export const useNoHook = (hookType: Hook) => () => {
  const fiber = useFiber();

  const i = pushState(fiber, hookType);
  const {state} = fiber;

  state![i] = undefined;
  state![i + 1] = undefined;
};

No-hooks like useNoMemo are also implemented, which allow for conditional hooks: write a matching else branch for any if. To ensure consistent rendering, a useNoHook will dispose of any state the useHook had, rather than just being a no-op. The above is just the basic version for simple hooks without cleanup.

This also lets the run-time support early return cleanly in Components: when exitFiber(fiber) is called, all remaining unconsumed state is disposed of with the right no-hook.

If someone calls a setState, this is added to a dispatch queue, so changes can be batched together. If f calls setState during its own render, this is caught and resolved within the same render cycle, by calling f again. A setState which is a no-op is dropped (pointer equality).

You can see however that Live hooks are not pure: when a useMemo is tripped, it will immediately overwrite the previous state during the render, not after. This means renders in Live are not stateless, only idempotent.

This is very deliberate. Live doesn't have a useEffect hook, it has a useResource hook that is like a useMemo with a useEffect-like disposal callback. While it seems to throw React's orchestration properties out the window, this is not actually so. What you get in return is an enormous increase in developer ergonomics, offering features React users are still dreaming of, running off 1 state array and 1 pointer.

Live is React with the training wheels off, not with holodeck security protocols disabled, but this takes a while to grok.

Mounts

After rendering, the returned/yielded value is reconciled with the previous rendered result. This is done by updateFiber(fiber, value).

New children are mounted, while old children are unmounted or have their args replaced. Only children with the same f as before can be updated in place.

{
  // ...
  
  // Static mount
  mount?: LiveFiber,

  // Dynamic mounts
  mounts?: Map,
  lookup?: Map,
  order?: Key[],

  // Continuation
  next?: LiveFiber,

  // Fiber type
  type?: LiveComponent,

  // ...
}

Mounts are tracked inside the fiber, either as a single mount, or a map mounts, pointing to other fiber objects.

The key for mounts is either an array index 0..N or a user-defined key. Keys must be unique.

The order of the keys is kept in a list. A reverse lookup map is created if they're not anonymous indices.

The mount is only used when a component renders 1 other statically. This excludes arrays of length 1. If a component switches between mount and mounts, all existing mounts are discarded.

Continuations are implemented as a special next mount. This is mounted by one of the built-in fenced operators such as or .

In the code, mounting is done via:

mountFiberCall(fiber, call) (static)
reconcileFiberCalls(fiber, calls) (dynamic)
mountFiberContinuation(fiber, call) (next).

Each will call updateMount(fiber, mount, jsx, key?, hasKeys?).

If an existing mount (with the same key) is compatible it's updated, otherwise a replacement fiber is made with makeSubFiber(…). It doesn't update the parent fiber, rather it just returns the new state of the mount (LiveFiber | null), so it can work for all 3 mounting types. Once a fiber mount has been updated, it's queued to be rendered with flushMount.

If updateMount returns false, the update was a no-op because fiber arguments were identical (pointer equality). The update will be skipped and the mount not flushed. This follows the same implicit memoization rule that React has. It tends to trigger when a stateful component re-renders an old props.children.

A subtle point here is that fibers have no links/pointers pointing back to their parent. This is part practical, part ideological. It's practical because it cuts down on cyclic references to complicate garbage collection. It's ideological because it helps ensures one-way data flow.

There is also no global collection of fibers, except in the inspector. Like in an effect system, the job of determining what happens is entirely the result of an ongoing computation on JSX, i.e. something passed around like pure, immutable data. The tree determines its own shape as it's evaluated.

Queue and Path

Live needs to process fibers in tree order, i.e. as in a typical tree list view. To do so, fibers are compared as values with compareFibers(a, b). This is based on references that are assigned only at fiber creation.

It has a path from the root of the tree to the fiber (at depth depth), containing the indices or keys.

{
  // ...

  depth: number,
  path: Key[],
  keys: (
    number |
    Map
  )[],
}

A continuation next is ordered after the mount or mounts. This allows data fences to work naturally: the run-time only ensures all preceding fibers have been run first. For this, I insert an extra index into the path, 0 or 1, to distinguish the two sub-trees.

If many fibers have a static mount (i.e. always 1 child), this would create paths with lots of useless zeroes. To avoid this, a single mount has the same path as its parent, only its depth is increased. Paths can still be compared element-wise, with depth as the tie breaker. This easily reduces typical path length by 70%.

This is enough for children without keys, which are spawned statically. Their order in the tree never changes after creation, they can only be updated in-place or unmounted.

But for children with a key, the expectation is that they persist even if their order changes. Their keys are just unsorted ids, and their order is stored in the fiber.order and fiber.lookup of the parent in question.

This is referenced in the fiber.keys array. It's a flattened list of pairs [i, fiber.lookup], meaning the key at index i in the path should be compared using fiber.lookup. To keep these keys references intact, fiber.lookup is mutable and always modified in-place when reconciling.

Memoization

If a Component function is wrapped in memo(...), it won't be re-rendered if its individual props haven't changed (pointer equality). This goes deeper than the run-time's own oldArgs !== newArgs check.

{
  // ...

  version?: number,
  memo?: number,

  runs?: number,
}

For this, memoized fibers keep a version around. They also store a memo which holds the last rendered version, and a run count runs for debugging:

The version is used as one of the memo dependencies, along with the names and values of the props. Hence a memo(...) cache can be busted just by incrementing fiber.version, even if the props didn't change. Versions roll over at 32-bit.

To actually do the memoization, it would be nice if you could just wrap the whole component in useMemo. It doesn't work in the React model because you can't call other hooks inside hooks. So I've brought back the mythical useYolo... An earlier incarnation of this allowed fiber.state scopes to be nested, but lacked a good purpose. The new useYolo is instead a useMemo you can nest. It effectively hot swaps the entire fiber.state array with a new one kept in one of the slots:

This is then the first hook inside fiber.state. If the memo succeeds, the yolo'd state is preserved without treating it as an early return. Otherwise the component runs normally. Yolo'ing as the first hook has a dedicated fast path but is otherwise a perfectly normal hook.

The purpose of fiber.memo is so the run-time can tell whether it rendered the same thing as before, and stop. It can just compare the two versions, leaving the specifics of memoization entirely up to the fiber component itself. For example, to handle a custom arePropsEqual function in memo(…).

I always use version numbers as opposed to isDirty flags, because it leaves a paper trail. This provides the same ergonomics for mutable data as for immutable data: you can store a reference to a previous value, and do an O(1) equality check to know whether it changed since you last accessed it.

Whenever you have a handle which you can't look inside, such as a pointer to GPU memory, it's especially useful to keep a version number on it, which you bump every time you write to it. It makes debugging so much easier.

Inlining

Built-in operators are resolved with a hand-coded routine post-render, rather than being "normal" components. Their component functions are just empty and there is a big dispatch with if statements. Each is tagged with a isLiveBuiltin: true.

If a built-in operator is an only child, it's usually resolved inline. No new mount is created, it's immediately applied as part of updating the parent fiber. The glue in between tends to be "kernel land"-style code anyway, it doesn't need a whole new fiber, and it's not implemented in terms of hooks. The only fiber state it has is the type (i.e. function) of the last rendered JSX.

There are several cases where it cannot inline, such as rendering one built-in inside another built-in, or rendering a built-in as part of an array. So each built-in can always be mounted independently if needed.

From an architectural point of view, inlining is just incidental complexity, but this significantly reduces fiber overhead and keeps the user-facing component tree much tidier. It introduces a few minor problems around cleanup, but those are caught and handled.

Live also has a morph operator. This lets you replace a mount with another component, without discarding any matching children or their state. The mount's own state is still discarded, but its f, args, bound and type are modified in-place. A normal render follows, which will reconcile the children.

This is implemented in morphFiberCall. It only works for plain vanilla components, not other built-ins. The reason to re-use the fiber rather than transplant the children is so that references in children remain unchanged, without having to rekey them.

In Live, I never do a full recursive traversal of any sub-tree, unless that traversal is incremental and memoized. This is a core property of the system. Deep recursion should happen in user-land.

Environment

Fibers have access to a shared environment, provided by their parent. This is created in user-land through built-in ops and accessed via hooks.

Context and captures
Gathers and yeets
Fences and suspend
Quotes and reconcilers
Unquote + quote

Context and captures

Live extends the classic React context:

{
  // ...

  context: {
    values: Map>,
    roots: Map,
  },
}

A LiveContext provides 1 value to N fibers. A LiveCapture collects N values into 1 fiber. Each is just an object created in user-land with makeContext / makeCapture, acting as a unique key. It can also hold a default value for a context.

The values map holds the current value of each context/capture. This is boxed inside a Ref as {current: value} so that nested sub-environments share values for inherited contexts.

The roots map points to the root fibers providing or capturing. This is used to allow useContext and useCapture to set up the right data dependency just-in-time. For a context, this points upstream in the tree, so to avoid a reverse reference, it's a number. For a capture, this points to a downstream continuation, i.e. the next of an ancestor, and can be a LiveFiber.

Normally children just share their parent's context. It's only when you or that Live builds a new, immutable copy of values and roots with a new context/capture added. Each context and capture persists for the lifetime of its sub-tree.

Captures build up a map incrementally inside the Ref while children are rendered, keyed by fiber. This is received in tree order after sorting:

 {
    ...
  }}
/>

You can also just write capture(context, children, then), FYI.

This is an await or yield in disguise, where the then closure is spiritually part of the originating component function. Therefor it doesn't need to be memoized. The state of the next fiber is preserved even if you pass a new function instance every time.

Unlike React-style render props, then props can use hooks, because they run on an independent next fiber called Resume(…). This fiber will be re-run when values changes, and can do so without re-running Capture itself.

A then prop can render new elements, building a chain of next fibers. This acts like a rewindable generator, where each Resume provides a place where the code can be re-entered, without having to explicitly rewind any state. This requires the data passed into each closure to be immutable.

The logic for providing or capturing is in provideFiber(fiber, ...) and captureFiber(fiber, ...). Unlike other built-ins, these are always mounted separately and are called at the start of a new fiber, not the end of previous one. Their children are then immediately reconciled by inlineFiberCall(fiber, calls).

Gathers and yeets

Live offers a true return, in the form of yeet(value) (aka {value}). This passes a value back to a parent.

These values are gathered in an incremental map-reduce along the tree, to a root that mounted a gathering operation. It's similar to a Capture, except it visits every parent along the way. It's the complement to tree expansion during rendering.

This works for any mapper and reducer function via . There is also an optimized code path for a simple array flatMap , as well as struct-of-arrays flatMap . It works just like a capture:

 {
    ...
  }}
/>


  ) => {
    ...
  }}
/>

Each fiber in a reduction has a fiber.yeeted structure, created at mount time. Like a context, this relation never changes for the lifetime of the component.

It acts as a persistent cache for a yeeted value of type A and its map-reduction reduced of type B:

{
  yeeted: {
    // Same as fiber (for inspecting)
    id: number,

    // Reduction cache at this fiber
    value?: A,
    reduced?: B,

    // Parent yeet cache
    parent?: FiberYeet,

    // Reduction root
    root: LiveFiber,

    // ...
  },
}

The last value yeeted by the fiber is kept so that all yeets are auto-memoized.

Each yeeted points to a parent. This is not the parent fiber but its fiber.yeeted. This is the parent reduction, which is downstream in terms of data dependency, not upstream. This forms a mirrored copy of the fiber tree and respects one-way data flow:

Again the linked root fiber (sink) is not an ancestor, but the next of an ancestor, created to receive the final reduced value.

If the reduced value is undefined, this signifies an empty cache. When a value is yeeted, parent caches are busted recursively towards the root, until an undefined is encountered. If a fiber mounts or unmounts children, it busts its reduction as well.

Fibers that yeet a value cannot also have children. This isn't a limitation because you can render a yeet beside other children, as just another mount, without changing the semantics. You can also render multiple yeets, but it's faster to just yeet a single list.

If you yeet undefined, this acts as a zero-cost signal: it does not affect the reduced values, but it will cause the reducing root fiber to be re-invoked. This is a tiny concession to imperative semantics, wildly useful.

This may seem very impure, but actually it's the opposite. With clean, functional data types, there is usually a "no-op" value that you could yeet: an empty array or dictionary, an empty function, and so on. You can always force-refresh a reduction without meaningfully changing the output, but it causes a lot of pointless cache invalidation in the process. Zero-cost signals are just an optimization.

When reducing a fiber that has a gathering next, it takes precedence over the fiber's own reduction: this is so that you can gather and reyeet in series, with the final reduction returned.

Fences and suspend

The specifics of a gathering operation are hidden behind a persistent emit and gather callback, derived from a classic map and reduce:

{
  yeeted: {
    // ...

    // Emit a value A yeeted from fiber
    emit: (fiber: LiveFiber, value: A) => void,

    // Gather a reduction B from fiber
    gather: (fiber: LiveFiber, self: boolean) => B,

    // Enclosing yeet scope
    scope?: FiberYeet,
  },
}

Gathering is done by the root reduction fiber, so gather is not strictly needed here. It's only exposed so you can mount a inside an existing reduction, without knowing its specifics. A fence will grab the intermediate reduction value at that point in the tree and pass it to user-land. It can then be reyeeted.

One such use is to mimic React Suspense using a special toxic SUSPEND symbol. It acts like a NaN, poisoning any reduction it's a part of. You can then fence off a sub-tree to contain the spill and substitute it with its previous value or a fallback.

In practice, gather will delegate to one of gatherFiberValues, multiGatherFiberValues or mapReduceFiberValues. Each will traverse the sub-tree, reuse any existing reduced values (stopping the recursion early), and fill in any undefineds via recursion. Their code is kinda gnarly, given that it's just map-reduce, but that's because they're hand-rolled to avoid useless allocations.

The self argument to gather is such an optimization, only true for the final user-visible reduction. This lets intermediate reductions be type unsafe, e.g. to avoid creating pointless 1 element arrays.

At a gathering root, the enclosing yeet scope is also kept. This is to cleanly unmount an inlined gather, by restoring the parent's yeeted.

Quotes and reconcilers

Live has a reconciler in reconcileFiberCalls, but it can also mount as an effect via mountFiberReconciler.

This is best understood by pretending this is React DOM. When you render a React tree which mixes with , React reconciles it, and extracts the HTML parts into a new tree:

=>

Each HTML element is implicitly quoted inside React. They're only "activated" when they become real on the right. The ones on the left are only stand-ins.

That's also what a Live does. It mounts a normal tree of children, but it simultaneously mounts an independent second tree, under its next mount.

If you render this:

...

You will get:

It adds a quote environment to the fiber:

{
  // ...
  quote: {
    // Reconciler fiber
    root: number,

    // Fiber in current sub-tree
    from: number,

    // Fiber in other sub-tree
    to: LiveFiber,

    // Enclosing reconciler scope
    scope?: FiberQuote,
  }
}

When you render a ..., whatever's inside ends up mounted on the to fiber.

Quoted fibers will have a similar fiber.unquote environment. If they render an ..., the children are mounted back on the quoting fiber.

Each time, the quoting or unquoting fiber becomes the new to fiber on the other side.

The idea is that you can use this to embed one set of components inside another as a DSL, and have the run-time sort them out.

This all happens in mountFiberQuote(…) and mountFiberUnquote(…). It uses reconcileFiberCall(…) (singular). This is an incremental version of reconcileFiberCalls(…) (plural) which only does one mount/unmount at a time. The fiber id of the quote or unquote is used as the key of the quoted or unquoted fiber.

const Queue = ({children}) => (
  reconcile(
    quote(
      gather(
        unquote(children),
        (v: any[]) =>
          
      ))));

The quote and unquote environments are separate so that reconcilers can be nested: at any given place, you can unquote 'up' or quote 'down'. Because you can put multiple s inside one , it can also fork. The internal non-JSX dialect is very Lisp-esque, you can rap together some pretty neat structures with this.

Because quote are mounted and unmounted incrementally, there is a data fence Reconcile(…) after each (un)quote. This is where the final set is re-ordered if needed.

The data structure actually violates my own rule about no-reverse links. After you , the fibers in the second tree have a link to the quoting fiber which spawned them. And the same applies in the other direction after you .

The excuse is ergonomics. I could break the dependency by creating a separate sub-fiber of to serve as the unquoting point, and vice versa. But this would bloat both trees with extra fibers, just for purity's sake. It already has unavoidable extra data fences, so this matters.

At a reconciling root, the enclosing quote scope is added to fiber.quote, just like in yeeted, again for clean unmounting of inlined reconcilers.

Unquote-quote

There is an important caveat here. There are two ways you could implement this.

One way is that ... is a Very Special built-in, which does something unusual: it would traverse the children tree it was given, and go look for ...s inside. It would have to do so recursively, to partition the quoted and unquoted JSX. Then it would have to graft the quoted JSX to a previous quote, while grafting the unquoted parts to itself as mounts. This is the React DOM mechanism, obfuscated. This is also how quoting works in Lisp: it switches between evaluation mode and AST mode.

I have two objections. The first is that this goes against the whole idea of evaluating one component incrementally at a time. It wouldn't be working with one set of mounts on a local fiber: it would be building up args inside one big nested JSX expression. JSX is not supposed to be a mutable data structure, you're supposed to construct it immutably from the inside out, not from the outside in.

The second is that this would only work for 'balanced' ...... pairs appearing in the same JSX expression. If you render:

...then you couldn't have render a and render an and have it work. It wouldn't be composable as two separate portals.

The only way for the quotes/unquotes to be revealed in such a scenario is to actually render the components. This means you have to actively run the second tree as it's being reconciled, same as the first. There is no separate update + commit like in React DOM.

This might seem pointless, because all this does is thread the data flow into a zigzag between the two trees, knitting the quote/unquote points together. The render order is the same as if and weren't there. The path and depth of quoted fibers reveals this, which is needed to re-render them in the right order later.

The key difference is that for all other purposes, those fibers do live in that spot. Each tree has its own stack of nested contexts. Reductions operate on the two separate trees, producing two different, independent values. This is just "hygienic macros" in disguise, I think.

Use.GPU's presentation system uses a reconciler to wrap the layout system, adding slide transforms and a custom compositor. This is sandwiched in-between it and the normal renderer.

A plain declarative tree of markup can be expanded into:

I also use a reconciler to produce the WebGPU command queue. This is shared for an entire app and sits at the top. The second tree just contains quoted yeets. I use zero-cost signals here too, to let data sources signal that their contents have changed. There is a short-hand for .

Note that you cannot connect the reduction of tree 1 to the root of tree 2: does not have a then prop. It doesn't make sense because the next fiber gets its children from elsewhere, and it would create a rendering cycle if you tried anyway.

If you need to spawn a whole second tree based on a first, that's what a normal gather already does. You can use it to e.g. gather lambdas that return memoized JSX. This effectively acts as a two-phase commit.

The Use.GPU layout system does this repeatedly, with several trees + gathers in a row. It involves constraints both from the inside out and the outside in, so you need both tree directions. The output is UI shapes, which need to be batched together for efficiency and turned into a data-driven draw call.

The Run-Time

With all the pieces laid out, I can now connect it all together.

Before render() can render the first fiber, it initializes a very minimal run-time. So this section will be kinda dry.

This is accessed through fiber.host and exposes a handful of APIs:

a queue of pending state changes
a priority queue for traversal
a fiber-to-fiber dependency tracker
a resource disposal tracker
a stack slicer for reasons

State changes

When a setState is called, the state change is added to a simple queue as a lambda. This allows simultaneous state changes to be batched together. For this, the host exposes a schedule and a flush method.

{
  // ...

  host: {
    schedule: (fiber: LiveFiber, task?: () => boolean | void) => void,
    flush: () => void,

    // ... 
  }

This comes from makeActionScheduler(…). It wraps a native scheduling function (e.g. queueMicrotask) and an onFlush callback:

const makeActionScheduler = (
  schedule: (flush: ArrowFunction) => void,
  onFlush: (fibers: LiveFiber[]) => void,
) => {
  // ...
  return {schedule, flush};
}

The callback is set up by render(…). It will take the affected fibers and call renderFibers(…) (plural) on them.

The returned schedule(…) will trigger a flush, so flush() is only called directly for sync execution, to stay within the same render cycle.

Traversal

The host keeps a priority queue (makePriorityQueue) of pending fibers to render, in tree order:

{
  // ...

  host: {
    // ...

    visit: (fiber: LiveFiber) => void,
    unvisit: (fiber: LiveFiber) => void,
    pop: () => LiveFiber | null,
    peek: () => LiveFiber | null,
  }
}

renderFibers(…) first adds the fibers to the queue by calling host.visit(fiber).

A loop in renderFibers(…) will then call host.peek() and host.pop() until the queue is empty. It will call renderFiber(…) and updateFiber(…) on each, which will call host.unvisit(fiber) in the process. This may also cause other fibers to be added to the queue.

The priority queue is a singly linked list of fibers. It allows fast appends at the start or end. To speed up insertions in the middle, it remembers the last inserted fiber. This massively speeds up the very common case where multiple fibers are inserted into an existing queue in tree order. Otherwise it just does a linear scan.

It also has a set of all the fibers in the queue, so it can quickly do presence checks. This means visit and unvisit can safely be called blindly, which happens a lot.

// Re-insert all fibers that descend from fiber
const reorder = (fiber: LiveFiber) => {
  const {path} = fiber;
  const list: LiveFiber[] = [];
  let q = queue;
  let qp = null;

  while (q) {
    if (compareFibers(fiber, q.fiber) >= 0) {
      hint = qp = q;
      q = q.next;
      continue;
    }
    if (isSubNode(fiber, q.fiber)) {
      list.push(q.fiber);
      if (qp) {
        qp.next = q.next;
        q = q.next;
      }
      else {
        pop();
        q = q.next;
      }
    }
    break;
  }

  if (list.length) {
    list.sort(compareFibers);
    list.forEach(insert);
  }
};

There is an edge case here though. If a fiber re-orders its keyed children, the compareFibers fiber order of those children changes. But, because of long-range dependencies, it's possible for those children to already be queued. This might mean a later cousin node could render before an earlier one, though never a child before a parent or ancestor.

In principle this is not an issue because the output—the reductions being gathered—will be re-reduced in new order at a fence. From a pure data-flow perspective, this is fine: it would even be inevitable in a multi-threaded version. In practice, it feels off if code runs out of order for no reason, especially in a dev environment.

So I added optional queue re-ordering, on by default. This can be done pretty easily because the affected fibers can be found by comparing paths, and still form a single group inside the otherwise ordered queue: scan until you find a fiber underneath the parent, then pop off fibers until you exit the subtree. Then just reinsert them.

This really reminds me of shader warp reordering in raytracing GPUs btw.

Dependencies

To support contexts and captures, the host has a long-range dependency tracker (makeDependencyTracker):

{
  host: {
    // ...

    depend: (fiber: LiveFiber, root: number) => boolean,
    undepend: (fiber: LiveFiber, root: number) => void,
    traceDown: (fiber: LiveFiber) => LiveFiber[],
    traceUp: (fiber: LiveFiber) => number[],
  }
};

It holds two maps internally, each mapping fibers to fibers, for precedents and descendants respectively. These are mapped as LiveFiber -> id and id -> LiveFiber, once again following the one-way rule. i.e. It gives you real fibers if you traceDown, but only fiber IDs if you traceUp. The latter is only used for highlighting in the inspector.

The depend and undepend methods are called by useContext and useCapture to set up a dependency this way. When a fiber is rendered (and did not memoize), bustFiberDeps(…) is called. This will invoke traceDown(…) and call host.visit(…) on each dependent fiber. It will also call bustFiberMemo(…) to bump their fiber.version (if present).

Yeets could be tracked the same way, but this is unnecessary because yeeted already references the root statically. It's a different kind of cache being busted too (yeeted.reduced) and you need to bust all intermediate reductions along the way. So there is a dedicated visitYeetRoot(…) and bustFiberYeet(…) instead.

Yeets are actually quite tricky to manage because there are two directions of traversal here. A yeet must bust all the caches towards the root. Once those caches are busted, another yeet shouldn't traverse them again until filled back in. It stops when it encounters undefined. Second, when the root gathers up the reduced values from the other end, it should be able to safely accept any defined yeeted.reduced as being correctly cached, and stop as well.

The invariant to be maintained is that a trail of yeeted.reduced === undefined should always lead all the way back to the root. New fibers have an undefined reduction, and old fibers may be unmounted, so these operations also bust caches. But if there is no change in yeets, you don't need to reduce again. So visitYeetRoot is not actually called until and unless a new yeet is rendered or an old yeet is removed.

Managing the lifecycle of this is simple, because there is only one place that triggers a re-reduction to fill it back in: the yeet root. Which is behind a data fence. It will always be called after the last cache has been busted, but before any other code that might need it. It's impossible to squeeze anything in between.

It took a while to learn to lean into this style of thinking. Cache invalidation becomes a lot easier when you can partition your program into "before cache" and "after cache". Compared to the earliest versions of Live, the how and why of busting caches is now all very sensible. You use immutable data, or you pass a mutable ref and a signal. It always works.

Resources

The useResource hook lets a user register a disposal function for later. useContext and useCapture also need to dispose of their dependency when unmounted. For this, there is a disposal tracker (makeDisposalTracker) which effectively acts as an onFiberDispose event listener:

{
  host: {
    // ...

    // Add/remove listener
    track: (fiber: LiveFiber, task: Task) => void,
    untrack: (fiber: LiveFiber, task: Task) => void,

    // Trigger listener
    dispose: (fiber: LiveFiber) => void,
  }
}

Disposal tasks are triggered by host.dispose(fiber), which is called by disposeFiber(fiber). The latter will also set fiber.bound to undefined so the fiber can no longer be called.

A useResource may change during a fiber's lifetime. Rather than repeatedly untrack/track a new disposal function each time, I store a persistent resource tag in the hook state. This holds a reference to the latest disposal function. Old resources are explicitly disposed of before new ones are created, ensuring there is no overlap.

Stack Slicing

A React-like is a recursive tree evaluator. A naive implementation would use function recursion directly, using the native CPU stack. This is what Live 0.0.1 did. But the run-time has overhead, with its own function calls sandwiched in between (e.g. updateFiber, reconcileFiberCalls, flushMount). This creates towering stacks. It also cannot be time-sliced, because all the state is on the stack.

In React this is instead implemented with a flat work queue, so it only calls into one component at a time. A profiler shows it repeatedly calling performUnitOfWork, beginWork, completeWork in a clean, shallow trace.

Live could do the same with its fiber priority queue. But the rendering order is always just tree order. It's only interrupted and truncated by memoization. So the vast majority of the time you are adding a fiber to the front of the queue only to immediately pop it off again.

The queue is a linked list so it creates allocation overhead. This massively complicates what should just be a native function call.

Live says "¿Por qué no los dos?" and instead has a stack slicing mechanism (makeStackSlicer). It will use the stack, but stop recursion after N levels, where N is a global knob that current sits at 20. The left-overs are enqueued.

This way, mainly fibers pinged by state changes and long-range dependency end up in the queue. This includes fenced continuations, which must always be called indirectly. If a fiber is in the queue, but ends up being rendered in a parent's recursion, it's immediately removed.

{
  host: {
    // ...

    depth: (depth: number) => void,
    slice: (depth: number) => boolean,
  },
}

When renderFibers gets a fiber from the queue, it calls host.depth(fiber.depth) to calibrate the slicer. Every time a mount is flushed, it will then call host.slice(mount.depth) to check if it should be sliced off. If so, it calls host.visit(…) to add it to the queue, but otherwise it just calls renderFiber / updateFiber directly. The exception is when there is a data fence, when the queue is always used.

Here too there is a strict mode, on by default, which ensures that once the stack has been sliced, no further sync evaluation can take place higher up the stack.

One-phase commit

Time to rewind.

A Live app consists of a tree of such fiber objects, all exactly the same shape, just with different state and environments inside. It's rendered in a purely one-way data flow, with only a minor asterisk on that statement.

The host is the only thing coordinating, because it's the only thing that closes the cycle when state changes. This triggers an ongoing traversal, during which it only tells fibers which dependencies to ping when they render. Everything else emerges from the composition of components.

Hopefully you can appreciate that Live is not actually Cowboy React, but something else and deliberate. It has its own invariants it's enforcing, and its own guarantees you can rely on. Like React, it has a strict and a non-strict mode that is meaningfully different, though the strictness is not about it nagging you, but about how anally the run-time will reproduce your exact declared intent.

It does not offer any way to roll back partial state changes once made, unlike React. This idempotency model of rendering is good when you need to accommodate mutable references in a reactive system. Immediate mode APIs tend to use these, and Live is designed to be plugged in to those.

The nice thing about Live is that it's often meaningful to suspend a partially rendered sub-tree without rewinding it back to the old state, because its state doesn't represent anything directly, like HTML does. It's merely reduced into a value, and you can simply re-use the old value until it has unsuspended. There is no need to hold on to all the old state of the components that produced it. If the value being gathered is made of lambdas, you have your two phases: the commit consists of calling them once you have a full set.

In Use.GPU, you work with memory on the GPU, which you allocate once and then reuse by reference. The entire idea is that the view can re-render without re-rendering all components that produced it, the same way that a browser can re-render a page by animating only the CSS transforms. So I have to be all-in on mutability there, because updated transforms have to travel through the layout system without re-triggering it.

I also use immediate mode for the CPU-side interactions, because I've found it makes UI controllers 2-3x less complicated. One interesting aspect here is that the difference between capturing and bubbling events, i.e. outside-in or inside-out, is just before-fence and after-fence.

Live is also not a React alternative: it plays very nicely with React. You can nest one inside the other and barely notice. The Live inspector is written in React, because I needed it to work even if Live was broken. It can memoize effectively in React because Live is memoized. Therefor everything it shows you is live, including any state you open up.

The inspector is functionality-first so I throw purity out the window and just focus on UX and performance. It installs a host.__ping callback so it can receive fiber pings from the run-time whenever they re-render. The run-time calls this via pingFiber in the right spots. Individual fibers can make themselves inspectable by adding undocumented/private props to fiber.__inspect. There are some helper hooks to make this prettier but that's all. You can make any component inspector-highlightable by having it re-render itself when highlighted.

* * *

Writing this post was a fun endeavour, prompting me to reconsider some assumptions from early on. I also fixed a few things that just sounded bad when said out loud. You know how it is.

I removed some lingering unnecessary reverse fiber references. I was aware they weren't load bearing, but that's still different from not having them at all. The only one I haven't mentioned is the capture keys, which are a fiber so that they can be directly compared. In theory it only needs the id, path, depth, keys, and I could package those up separately, but it would just create extra objects, so the jury allows it.

Live can model programs shaped like a one-way data flow, and generates one-way data itself. There are some interesting correspondences here.

Live keep state entirely in fiber objects, while fibers run entirely on fiber.state. A fiber object is just a fixed dictionary of properties, always the same shape, just like fiber.state is for a component's lifetime.
Children arrays without keys must be fixed-length and fixed-order (a fragment), but may have nulls. This is very similar to how no-hooks will skip over a missing spot in the fiber.state array and zero out the hook, so as to preserve hook order.
Live hot-swaps a global currentFiber pointer to switch fibers, and useYolo hot-swaps a fiber's own local state to switch hook scopes.
Memoizing a component can be implemented as a nested useMemo. Bumping the fiber version is really a bespoke setState which is resolved during next render.

The lines between fiber, fiber.state and fiber.mounts are actually pretty damn blurry.

A lot of mechanisms appear twice, once in a non-incremental form and once in an incremental form. Iteration turns into mounting, sequences turn into fences, and objects get chopped up into fine bits of cached state, either counted or with keys. The difference between hooks and a gather of unkeyed components gets muddy. It's about eagerness and dependency.

If Live is react-react, then a self-hosted live-live is hiding in there somewhere. Create a root fiber, give it empty state, off you go. Inlining would be a lot harder though, and you wouldn't be able to hand-roll fast paths as easily, which is always the problem in FP. For a JS implementation it would be very dumb, especially when you know that the JS VM already manages object prototypes incrementally, mounting one prop at a time.

I do like the sound of an incremental Lisp where everything is made out of flat state lists instead of endless pointer chasing. If it had the same structure as Live, it might only have one genuine linked list driving it all: the priority queue, which holds elements pointing to elements. The rest would be elements pointing to linear arrays, a data structure that silicon caches love. A data-oriented Lisp maybe? You could even call it an incremental GPU. Worth pondering.

What Live could really use is a WASM pony with better stack access and threading. But in the mean time, it already works.

The source code for the embedded examples can be found on GitLab.

If your browser can do WebGPU (desktop only for now), you can load up any of the Use.GPU examples and inspect them.

Teardown Frame Teardown

2023-01-24T00:00:00+01:00

Rendering analysis

In this post I'll do a "one frame" breakdown of Tuxedo Labs' indie game Teardown.

The game is unique for having a voxel-driven engine, which provides a fully destructible environment. It embraces this boon, by giving the player a multitude of tools that gleefully alter and obliterate the setting, to create shortcuts between spaces. This enables a kind of gameplay rarely seen: where the environment is not just a passive backdrop, but a fully interactive part of the experience.

This is highly notable. In today's landscape of Unity/Unreal-powered gaming titles, it illustrates a very old maxim: that novel gameplay is primarily the result of having a dedicated game engine to enable that play. In doing so, it manages to evoke a feeling that is both incredibly retro and yet unquestionably futuristic. But it's more than that: it shows that the path graphics development has been walking, in search of ever more realistic graphics, can be bent and subverted entirely. It creates something wholly unique and delightful, without seeking true photorealism.

It utilizes raytracing, to present global illumination, with real-time reflections, and physically convincing smoke and fire. It not only has ordinary vehicles, like cars and vans, but also industrial machinery like bulldozers and cranes, as well as an assortment of weapons and explosives, to bring the entire experience together. Nevertheless, it does not require the latest GPU hardware: it is an "ordinary" OpenGL application. So how does it do it?

The classic way to analyze this would be to just fire up RenderDoc and present an analytical breakdown of every buffer rendered along the way. But that would be doing the game a disservice. Not only is it much more fun to try and figure it out on your own, the game actually gives you all the tools you need to do so. It would be negligent not to embrace it. RenderDoc is only part 2.

Teardown is, in my view, a love letter to decades of real-time games and graphics. It features a few winks and nods to those in the know, but on the whole its innovations have gone sadly unremarked. I'm disappointed we haven't seen an explosion of voxel-based games since. Maybe this will change that.

I will also indulge in some backseat graphics coding. This is not to say that any of this stuff is easy. Rather, I've been writing my own .vox renderer in Use.GPU, which draws heavily from Teardown's example.

Hunting for Clues

The Voxels

Let's start with the most obvious thing: the voxels. At a casual glance, every Teardown level is made out of a 3D grid. The various buildings and objects you encounter are made out of tiny cubes, all the same size, like this spiral glass staircase in the Villa Gordon:

However, closer inspection shows something curious. Behind the mansion is a ramp—placed there for obvious reasons—which does not conform to the strict voxel grid at all: it has diagonal surfaces. More detailed investigation of the levels will reveal various places where this is done.

The various dynamic objects, be they crates, vehicles or just debris, also don't conform to the voxel grid: they can be moved around freely. Therefore this engine is not strictly voxel-grid-based: rather, it utilizes cube-based voxels inside a freeform 3D environment.

There is another highly salient clue here, in the form of the game's map screen. When you press M, the game zooms out to an overhead view. Not only is it able to display all these voxels from a first person view, it is able to show an entire level's worth of voxels, and transition smoothly to-and-fro, without any noticeable pop-in. Even on a vertical, labyrinthine 3D level like Quilez Security.

This implies that however this is implemented, the renderer largely does not care how many voxels are on screen in total. It somehow utilizes a rendering technique that is independent of the overall complexity of the environment, and simply focuses on what is needed to show whatever is currently in view.

The Lighting

The next big thing to notice is the lighting in this game, which appears to be fully real-time.

Despite the chunky environment, shadows are cast convincingly. This casually includes features that are still a challenge in real-time graphics, such as lights which cast from a line, area or volume rather than a single point. But just how granular is it?

There are places where, to a knowing eye, this engine performs dark magic. Like the lighting around this elevator:

Not only is it rendering real-time shadows, it is doing so for area-lights in the floor and ceiling. This means a simple 2D shadow-map, rendering depth from a single vantage point, is insufficient. It is also unimaginable that it would do so for every single light-emitting voxel, yet at first sight, it does.

This keeps working even if you pick up a portable light and wave it around in front of you. Even if the environment has been radically altered, the renderer casts shadows convincingly, with no noticeable lag. The only tell is the all-pervasive grain: clearly, it is using noise-techniques to deal with gradients and sampling.

The Reflections

It's more than just lights. The spiral staircase from before is in fact reflected clearly in the surrounding glass. This is consistent regardless of whether the staircase is itself visible:

This is where the first limitations start to pop up. If you examine the sliding doors in the same area, you will notice something curious: while the doors slide smoothly, their reflections do not:

There are two interesting artifacts in this area:

The first is that glossy reflections of straight lines have a jagged appearance. The second is that you can sometimes catch moving reflections splitting before catching up, as if part of the reflection is not updated in sync with the rest.

The game also has actual mirrors:

Here we can begin to dissect the tricks. Most obvious is that some of the reflections are screen-space: mirrors will only reflect objects in full-color if they are already on screen. If you turn away, the reflection becomes dark and murky. But this is not an iron rule: if you blast a hole in a wall, it will still be correctly reflected, no matter the angle. It is only the light cast onto the floor through that hole which fails to be reflected under all circumstances.

This clip illustrates another subtle feature: up close, the voxels aren't just hard-edged cubes. Rather, they appear somewhat like plastic lego bricks, with rounded edges. These edges reflect the surrounding light smoothly, which should dispel the notion that what we are seeing is mere simple vector geometry.

There is a large glass surface nearby which we can use to reveal more. If we hold an object above a mirror, the reflection does not move smoothly. Rather, it is visibly discretized into cubes, only moving on a rough global grid, regardless of its own angle.

This explains the sliding doors. In order to reflect objects, the renderer utilizes some kind of coarse voxel map, which can only accommodate a finite resolution.

There is only one objectionable artifact which we can readily observe: whenever looking through a transparent surface like a window, and moving sideways, the otherwise smooth image suddenly becomes a jittery mess. Ghost trails appear behind the direction of motion:

This suggests that however the renderer is dealing with transparency, it is a poor fit for the rest of its bag of tricks. There is in fact a very concise explanation for this, which we'll get to.

Still, this is all broadly black magic. According to the commonly publicized techniques, this should simply not be possible, not on hardware incapable of accelerated raytracing.

The Solids

Time for the meat-and-potatoes: a careful breakdown of a single frame. It is difficult to find one golden frame that includes every single thing the renderer does. Nevertheless, the following is mostly representative:

Captures were done at 1080p, with uncompressed PNGs linked. Alpha channels are separated where relevant. The inline images have been adjusted for optimal viewing, while the linked PNGs are left pristine unless absolutely necessary.

G-buffer

If we fire up RenderDoc, a few things will become immediately apparent. Teardown uses a typical deferred G-buffer, with an unusual 5 render targets, plus the usual Z-buffer, laid out as follows:

Albedo (RT0)

Normal (RT1)

Material (RT2 RGB)

Emissive (RT2 Alpha)

Velocity + Water (RT3)

Linear Depth (RT4)

Draw calls

Every draw call renders exactly 36 vertices, i.e. 12 triangles, making up a box. But these are not voxels: each object in Teardown is rendered by drawing the shape's bounding box. All the individual cubes you see don't really exist as geometry. Rather, each object is stored as a 3D volume texture, with one byte per voxel.

Thus, the primary rendering stream consists of one draw call per object, each with a unique 3D texture bound. Each indexes into a 256-entry palette consisting of both color and material properties. The green car looks like this:

This only covers the chassis, as the wheels can move independently, handled as 4 separate objects.

The color and material palettes for all the objects are packed into one large texture each:

Having reflectivity separate from metallicness might seem odd, as they tend to be highly correlated. Some materials are reflective without being metallic, such as water and wet surfaces. Some materials are metallic without being fully reflective, perhaps to simulate dirt.

You may notice a lot of yellow in the palette: this is because of the game's yellow spray can, detailed in this blog post. It requires a blend of each color towards yellow, as it is applied smoothly. This is in fact the main benefit of this approach: as each object is just a 3D "sprite", it is easy and quick to remove individual voxels, or re-paint them for e.g. vehicle skid marks or bomb scorching.

When objects are blasted apart, the engine will separate them into disconnected chunks, and make a new individual object for each. This can be repeated indefinitely.

Rendering proceeds front-to-back, as follows:

The shader for this is tightly optimized and quite simple. It will raytrace through each volume, starting at the boundary, until it hits a solid voxel. It will repeatedly take a step in the X, Y or Z direction, whichever is less.

To speed up this process, the renderer uses 2 additional MIP maps, at half and quarter size, which allow it to skip over 2×2×2 or 4×4×4 empty voxels at a time. It will jump up and down MIP levels as it encounters solid or empty areas. Because MIP map sizes are divided by 2 and then rounded down, all object dimensions must be a multiple of 4, to avoid misalignment. This means many objects have a border of empty voxels around them.

Curiously, Teardown centers each object inside its expanded volume, which means the extra border tends to be 1 or 2 voxels on each side, rather than 2 or 3 on one. This means its voxel-skipping mechanism cannot work as effectively. Potentially this issue could be avoided entirely by not using native MIP maps at all, and instead just using 3 separately sized 3D textures, with dimensions that are rounded up instead of down.

As G-buffers can only handle solid geometry, the renderer applies a 50% screen-door effect to transparent surfaces. This explains the ghosting artifacts earlier, as it confuses the anti-aliasing logic that follows. To render transparency other than 50%, e.g. to ghost objects in third-person view, it uses a blue-noise texture with thresholding.

This might seem strange, as the typical way to render transparency in a deferred renderer is to apply it separately, at the very end. Teardown cannot easily do this however, as transparent voxels are mixed freely among solid ones.

Another thing worth noting here: because each raytraced pixel sits somewhere inside its bounding box volume, the final Z-depth of each pixel cannot be known ahead of time. The pixel shader must calculate it as part of the raytracing, writing it out using the gl_FragDepth API. As the GPU does not assume that this depth is actually deeper than the initial depth, the native Z-buffer cannot do any early Z rejection. This would mean that even 100% obscured objects would have to be raytraced fully, only to be entirely discarded.

To avoid this, Teardown has its own early-Z mechanism, which uses the additional depth target in the RT4 slot. Before it starts raytracing a pixel, it checks to see if the front of the volume is already obscured. However, GPUs forbid reading and writing from the same render target, to avoid race conditions. So Teardown must periodically pause and copy the current RT4 state to another buffer. For the scene above, there are 8 such "checkpoints". This means that objects part of the same batch will always be raytraced in full, even if one of them is in front of the other.

Certain modern GPU APIs have extensions to signal that gl_FragDepth will always be deeper than the initial Z. If Teardown could make use of this, it could avoid this extra work. In fact, we can wonder why GPU makers didn't do this from the start, because pushing pixels closer to the screen, out of a bounding surface, doesn't really make sense: they would disappear at glancing angles.

Once all the voxel objects are drawn, there are two more draws. First the various cables, ropes and wires, drawn using a single call for the entire level. This is the only "classic" geometry in the entire scene, e.g. the masts and tethers on the boats here:

Second, the various smoke particles. These are simulated on the CPU, so there are no real clues as to how. They appear to billow quite realistically. This presentation by the creator offers some possible clues as to what it might be doing.

Here too, the renderer makes eager use of blue-noise based screen door transparency. It will also alternate smoke pixels between forward-facing and backward-facing in the normal buffer, to achieve a faux light-scattering effect.

Finally, the drawing finishes by adding the map-wide water surface. While the water is generally murky, objects near the surface do refract correctly. For this, the albedo buffer is first copied to a new buffer (again to avoid race conditions), and then used as a source for the refraction shader. Water pixels are marked in the unused blue channel of the motion vector buffer.

The game also has dynamic foam ripples on the water, when swimming or driving a boat. For this, the last N ripples are stored and evaluated in the same water shader, expanding and fading out over time:

While all draw calls are finished, Teardown still has one trick up its sleeve here. To smooth off the sharp edges of the voxel cubes... it simply blurs the final normal buffer. This is applied only to voxels that are close to the camera, and is limited to nearby pixels that have almost the same depth. In the view above, the only close-by voxels are those of the player's first-person weapon, so those are the only ones getting smoothed.

Puddles and Volumes

Next up is the game's rain puddle effect. This is applied using a screen-wide shader, which uses perlin-like noise to create splotches in the material buffer. This applies on any upward facing surface, using the normal buffer, altering the roughness channel (zero roughness is stored as 1.0).

This wouldn't be remarkable except for one detail: how the renderer avoids drawing puddles indoors and under awnings. This is where the big secret appears for the first time. Remember that coarse voxel map whose existence we inferred earlier?

Yeah it turns out, Teardown will actually maintain a volumetric shadow map of the entire play area at all times. For the Marina level, it's stored in a 1752×100×1500 3D texture, a 262MB chonker. Here's a scrub through part of it:

But wait, there's more. Unlike the regular voxel objects, this map is actually 1-bit. Each of its 8-bit texels stores 2×2×2 voxels. So it's actually a 3504×200×3000 voxel volume. Like the other 3D textures, this has 2 additional MIP levels to accelerate raytracing, but it has that additional "-1" MIP level inside the bits, which requires a custom loop to trace through it.

This map is updated using many small texture uploads in the middle of the render. So it's actually CPU-rendered. Presumably this happens on a dedicated thread, which might explain the desynchronization we saw before. The visible explosion in the frame created many small moving fragments, so there are ~50 individual updates here, multiplied by 3 for the 3 MIP levels.

Because the puddle effect is all procedural, they disappear locally when you hold something over them, and appear on the object instead, which is kinda hilarious:

To know where to start tracing in world space, each pixel's position is reconstructed from the linear depth buffer. This is a pattern that reoccurs in everything that follows. A 16-bit depth buffer isn't very accurate, but it's good enough, and it doesn't use much bandwidth.

Unlike the object voxel tracing, the volumetric shadow map is always traced approximately. Rather than doing precise X/Y/Z steps, it will just skip ahead a certain distance until it finds itself inside a solid voxel. This works okay, but can miss voxels entirely. This is the reason why many reflections have a jagged appearance.

There are in fact two tracing modes coded: sparse and "super sparse". The latter will only do a few steps in each MIP level, starting at -1, before moving to the next coarser one. This effectively does a very rough version of voxel cone tracing, and is the mode used for puddle visibility.

The Lighting

On to the next part: how the renderer actually pulls off its eerily good lighting.

Contrary to first impressions, it is not the voxels themselves that are casting the light: emissive voxels must be accompanied by a manually placed light to illuminate their surroundings. When destroyed, this light is then removed, and the emissive voxels are turned off as a group.

As is typical in a deferred renderer, each source of light is drawn individually into a light buffer, affecting only the pixels within the light's volume. For this, the renderer has various meshes which match each light type's shape. These are procedurally generated, so that e.g. each spotlight's mesh has the right cone angle, and each line light is enclosed by a capsule with the right length and radius:

The volumetric shadow map is once again the main star, helped by a generous amount of blue noise and stochastic sampling. This uses Martin Roberts' quasi-random sequences to produce time-varying 1D, 2D and 3D noise from a static blue noise texture. The light itself is also split up, separated into diffuse, specular and volumetric irradiance components.

Diffuse light

It begins with ambient sky light:

This looks absolutely lovely, with large scale occlusion thanks to volumetric ray tracing in "super sparse" mode. This uses cosine-weighted sampling in a hemisphere around each point, with 2 samples per pixel. To render small scale occlusion, it will first do a single screen-space step one voxel-size out, using the linear depth buffer.

Notice that the tree tops at the very back do not have any large scale occlusion: they extend beyond the volume of the shadow map, which is vertically limited.

Next up are the individual lights. These are not point lights, they have an area or volume. This includes support for "screen" lights, which display an image, used in other scenes. To handle this, each lit pixel picks a random point somewhere inside the light's extent. The shadows are handled with a raytrace between the surface and the chosen light point, with one ray per pixel.

As this is irradiance, it does not yet factor in the color of each surface. This allows for aggressive denoising, which is the next step. This uses a spiral-shaped blur filter around each point, weighted by distance. The weights are also attenuated by both depth and normal: the depth of each sample must lie within the tangent plane of the center, and its normal must match.

This blurred result is immediately blended with the result of the previous frame, which is shifted using the motion vectors rendered for each pixel.

Finally, the blurred diffuse irradiance is multiplied with the non-blurred albedo (i.e. color) of every surface, to produce outgoing diffuse radiance:

Specular light

As the experiment with the mirror showed, the renderer doesn't really distinguish between glossy specular reflections and "real" mirror reflections. Both are handled as part of the same process, which uses the diffuse light buffer as an input. This is drawn using a single full-screen render.

As we saw, there are both screen-space and global reflections. Unconventionally, the screen-space reflections are also traced using the volumetric shadow map, rather than the normal 2D Z-buffer. Glossyness is handled using... you guessed it... stochastic sampling based on blue noise. The rougher the surface, the more randomly the direction of the reflected ray is altered. Voxels with zero reflectivity are skipped entirely, creating obvious black spots.

If a voxel was hit, its position is converted to its 2D screen coordinates, and its color is used, but only if it sits at the right depth. This must also fade out to black at the screen edges. If no hit could be found within a certain distance, it instead uses a cube environment map, attenuated by fog, here a deep red.

The alpha channel is used to store the final reflectivity of each surface, factoring in fresnel effects and viewing angle:

This is then all denoised similar to the diffuse lighting, but without an explicit blur. It's blended only with the previous reprojected specular result, blending more slowly the glossier—and noisier—the surface is:

Volumetric light

Volumetric lights are the most expensive, hence this part is rendered on a buffer half the width and height. It uses the same light meshes as the diffuse lighting, only with a very different shader.

For each pixel, the light position is again jittered stochastically. It will then raytrace through a volume around that position, to determine where the light contribution along the ray starts and ends. Finally it steps between those two points, accumulating in-scattered light along the way. As is common, it will also jitter the steps along the ray.

This is expensive because at every step, it must trace a secondary ray towards the light, to determine volumetric shadowing. To cut down on the number of extra rays, this is only done if the potential light contribution is actually large enough to make a visible difference. To optimize the trace and keep the ray short, it will trace towards the closest point on the light, rather than the jittered point.

The resulting buffer is still the noisiest of them all, so once again, there is a blurring and temporal blending step. This uses the same spiral filter as the diffuse lighting, but lacks the extra weights of planar depth and normal. Instead, the depth buffer is only used to prevent the fog from spilling out in front of nearby occluders:

Compositing

All the different light contributions are now added together, with depth fog and a skybox added to complete it. Interestingly, while it looks like a height-based fog which thins out by elevation, it is actually just based on vertical view direction. A clever trick, and a fair amount cheaper.

The Post-Processing

At this point we have a physically correct, lit, linear image. So now all that remains is to mess it up.

There are several effects in use:

Motion blur
Depth of field
Temporal anti-aliasing
Bloom
Lens distortion
Lens dirt
Vehicle outline

Several of these are optional.

Motion Blur

If turned on, it is applied here using the saved motion vectors. This uses a variable number of steps per pixel, up to 10. Unfortunately it's extremely subtle and difficult to get a good capture of, so I don't have a picture.

Depth of Field

This effect requires a dedicated capture to properly show, as it is hardly noticeable on long-distance scenes. I will use this shot, where the DOF is extremely shallow because I'm holding the gate in the foreground:

First, the renderer needs to know the average depth in the center of the image. To do so, it samples the linear depth buffer in the middle, again with a spiral blur filter. It's applied twice, one with a large radius and one small. This is done by rendering directly to a 1x1 size image, which is also blended over time with the previous result. This produces the average focal distance.

Next it will render a copy of the image, with the alpha channel (float) proportional to the amount of blur needed (the circle of confusion). This is essentially any depth past the focal point, though it will bias the center of the image to remain more in focus:

The renderer will now perform a 2x2 downscale, followed by a blue-noise jittered upscale. This is done even if DOF is turned off, which suggests the real purpose here is to even out the image and remove the effects of screen door transparency.

Actual DOF will now follow, rendered again to a half-sized image, to cut down on the cost of the large blur radius. This again uses a spiral blur filter. This will use the alpha channel to mask out any foreground samples, to prevent them from bleeding onto the background. Such samples are instead replaced with the average color so far, a trick documented here.

Now it combines the sharp-but-jittered image with the blurry DOF image, using the alpha channel as the blending mask.

Temporal anti-aliasing

At this point the image is smoothed with a variant of temporal anti-aliasing (TXAA), to further eliminate any left-over jaggies and noise. This is now the fourth time that temporal reprojection and blending was applied in one frame: this is no surprise, given how much stochastic sampling was used to produce the image in the first place.

To help with anti-aliasing, as is usual, the view point itself is jittered by a tiny amount every frame, so that even if the camera doesn't move, it gets varied samples to average out.

Exposure and bloom

For proper display, the renderer will determine the appropriate exposure level to use. For this, it needs to know the average light value in the image.

First it will render a 256x256 grayscale image. It then progressively downsamples this by 2, until it reaches 1x1. This is then blended over time with the previous result to smooth out the changes.

Using the exposure value, it then produces a bloom image: this is a heavily thresholded copy of the original, where all but the brightest areas are black. This image is half the size of the original.

This half-size bloom image is then further downscaled and blurred more aggressively, by 50% each time, down to ~8px. At each step it does a separate horizontal and vertical blur, achieving a 2D gaussian filter:

The resulting stack of images is then composed together to produce a soft glow with a very large effective radius, here exaggerated for effect:

Final composition

Almost done: the DOF'd image is combined with bloom, multiplied with the desired exposure, and then gamma corrected. If lens distortion is enabled, it is applied too. It's pretty subtle, and here it is just turned off. Lens dirt is missing too: it is only used if the sun is visible, and then it's just a static overlay.

All that remains is to draw the UI on top. For this it uses a signed-distance-field font atlas, and draws the crosshairs icon in the middle:

Bonus Shots

To conclude, a few bonus images.

Ghosting

While in third person vehicle view, the renderer will ghost any objects in front of it. As a testament to the power of temporal smoothing, compare the noisy "before" image with the final "after" result:

To render the white outline, the vehicle is rendered to an offscreen buffer in solid white, and then a basic edge detection filter is applied.

Mall Overdraw

The Evertides Mall map is one of the larger levels in the game, featuring a ton of verticality, walls, and hence overdraw. It is here that the custom early-Z mechanism really pays off:

That concludes this deep dive. Hope you enjoyed it as much as I did making it.

More reading/viewing:

Another Teardown teardown.
Video stream with an in-engine walkthrough.

Use.GPU Goes Trad

2023-01-14T00:00:00+01:00

Old is new again

I've released a new version of Use.GPU, my experimental reactive/declarative WebGPU framework, now at version 0.8.

My goal is to make GPU rendering easier and more sane. I do this by applying the lessons and patterns learned from the React world, and basically turning them all up to 11, sometimes 12. This is done via my own Live run-time, which is like a martian React on steroids.

The previous 0.7 release was themed around compute, where I applied my shader linker to a few challenging use cases. It hopefully made it clear that Use.GPU is very good at things that traditional engines are kinda bad at.

In comparison, 0.8 will seem banal, because the theme was to fill the gaps and bring some traditional conveniences, like:

Scenes and nodes with matrices
Meshes with instancing
Shadow maps for lighting
Visibility culling for geometry

These were absent mostly because I didn't really need them, and they didn't seem like they'd push the architecture in novel directions. That's changed however, because there's one major refactor underpinning it all: the previously standard forward renderer is now entirely swappable. There is a shiny deferred-style renderer to showcase this ability, where lights are rendered separately, using a g-buffer with stenciling.

This new rendering pipeline is entirely component-driven, and fully dogfooded. There is no core renderer per-se: the way draws are realized depends purely on the components being used. It effectively realizes that most elusive of graphics grails, which established engines have had difficulty delivering on: a data-driven, scriptable render pipeline, that mortals can hopefully use.

Root of the App

Deep inside the tree

I've spent countless words on Use.GPU's effect-based architecture in prior posts, which I won't recap. Rather, I'll just summarize the one big trick: it's structured entirely as if it needs to produce only 1 frame. Then in order to be interactive, and animate, it selectively rewinds parts of the program, and reactively re-runs them. If it sounds crazy, that's because it is. And yet it works.

So the key point isn't the feature list above, but rather, how it does so. It continues to prove that this way of coding can pay off big. It has all the benefits of immediate-mode UI, with none of the downsides, and tons of extensibility. And there are some surprises along the way.

Real Reactivity

You might think: isn't this a solved problem? There are plenty of JS 3D engines. Hasn't React-Three-Fiber (R3F) shown how to make that declarative? And aren't these just web versions of what native engines like Unreal and Unity already do well, and better?

My answer is no, but it might not be clear why. Let me give an example from my current job.

My client needs a specialized 3D editing tool. In gaming terms you might think of it as a level design tool, except the levels are real buildings. The details don't really matter, only that they need a custom 3D editing UI. I've been using Three.js and R3F for it, because that's what works today and what other people know.

Three.js might seem like a great choice for the job: it has a 3D scene, editing controls and so on. But, my scene is not the source of truth, it's the output of a process. The actual source of truth being live-edited is another tree that sits before it. So I need to solve a two-way synchronization problem between both. This requires careful reasoning about state changes.

Change handlers in Three.js and R3F

Sadly, the way Three.js responds to changes is ill-defined. As is common, its objects have "dirty" flags. They are resolved and cleared when the scene is re-rendered. But this is not an iron rule: many methods do trigger a local refresh on the spot. Worse, certain properties have an invisible setter, which immediately triggers a "change" event when you assign a new value to it. This also causes derived state to update and cascade, and will be broadcast to any code that might be listening.

The coding principle applied here is "better safe than sorry". Each of these triggers was only added to fix a particular stale data bug, so their effects are incomplete, creating two big problems. Problem 1 is a mix of old and new state... but problem 2 is you can only make it worse, by adding even more pre-emptive partial updates, sprinkled around everywhere.

These "change" events are oblivious to the reason for the change, and this is actually key: if a change was caused by a user interaction, the rest of the app needs to respond to it. But if the change was computed from something else, then you explicitly don't want anything earlier to respond to it, because it would just create an endless cycle, which you need to detect and halt.

R3F introduces a declarative model on top, but can't fundamentally fix this. In fact it adds a few new problems of it own in trying to bridge the two worlds. The details are boring and too specific to dig into, but let's just say it took me a while to realize why my objects were moving around whenever I did a hot-reload, because the second render is not at all the same as the first.

Yet this is exactly what one-way data flow in reactive frameworks is meant to address. It creates a fundamental distinction between the two directions: cascading down (derived state) vs cascading up (user interactions). Instead of routing both through the same mutable objects, it creates a one-way reverse-path too, triggered only in specific circumstances, so that cause and effect are always unambigious, and cycles are impossible.

Three.js is good for classic 3D. But if you're trying to build applications with R3F it feels fragile, like there's something fundamentally wrong with it, that they'll never be able to fix. The big lesson is this: for code to be truly declarative, changes must not be allowed to travel backwards. They must also be resolved consistently, in one big pass. Otherwise it leads to endless bug whack-a-mole.

What reactivity really does is take cache invalidation, said to be the hardest problem, and turn the problem itself into the solution. You never invalidate a cache without immediately refreshing it, and you make that the sole way to cause anything to happen at all. Crazy, and yet it works.

When I tell people this, they often say "well, it might work well for your domain, but it couldn't possibly work for mine." And then I show them how to do it.

Figuring out which way your cube map points:
just gfx programmer things.

And... Scene

One of the cool consequences of this architecture is that even the most traditional of constructs can suddenly bring neat, Lispy surprises.

The new scene system is a great example. Contrary to most other engines, it's actually entirely optional. But that's not the surprising part.

Normally you just have a tree where nodes contain other nodes, which eventually contain meshes, like this:

It's a way to compose matrices: they cascade and combine from parent to child. The 3D engine is then built to efficiently traverse and render this structure.

But what it ultimately does is define a transform for every mesh: a function vec3 => vec3 that maps one vertex position to another. So if you squint, is really just a marker for a place where you stop composing matrices and pass a composed matrix transform to something else.

Hence Use.GPU's equivalent, , could actually be called . What it does is escape from the scene model, mirroring the Lisp pattern of quote-unquote. A chain of parents is just a domain-specific-language (DSL) to produce a TransformContext with a shader function, one that applies a single combined matrix transform.

In turn, just becomes a combination of and a , i.e. triangle geometry that uses the transform. It all composes cleanly.

So if you just put meshes inside the scene tree, it works exactly like a traditional 3D engine. But if you put, say, a polar coordinate plot in there from the plot package, which is not a matrix transform, inside a primitive, then it will still compose cleanly. It will combine the transforms into a new shader function, and apply it to whatever's inside. You can unscene and scene repeatedly, because it's just exiting and re-entering a DSL.

In 3D this is complicated by the fact that tangents and normals transform differently from vertices. But, this was already addressed in 0.7 by pairing each transform with a differential function, and using shader fu to compose it. So this all just keeps working.

Another neat thing is how this works with instancing. There is now an component, which is exactly like , except that it gives you a dynamic to copy/paste via a render prop:

 (<>
     
     
   )
 />

As you might expect, it will gather the transforms of all instances, stuff all of them into a single buffer, and then render them all with a single draw call. The neat part is this: you can still wrap individual components in as many levels as you like. Because all does is pass its matrix transform back up the tree to the parent it belongs to.

This is done using Live captures, which are React context providers in reverse. It doesn't violate one-way data flow, because captures will only run after all the children have finished running. Captures already worked previously, the semantics were just extended and formalized in 0.8 to allow this to compose with other reduction mechanisms.

But there's more. Not only can you wrap in , you can also wrap either of them in , which is Use.GPU's keyframe animator, entirely unchanged since 0.7:

 (

    
      
        {seq(20).map(i => (
          
            
          
        ))}
      
    

  )}
/>

The scene DSL and the instancing DSL and the animation DSL all compose directly, with nothing up my sleeve. Each of these are still just ordinary functions. On the inside they look like constructors with all the other code missing. There is zero special casing going on here, and none of them are explicitly walking the tree to reach each other. The only one doing that is the reactive run-time... and all it does is enforce one-way data flow by calling functions, gathering results and busting caches in tree order. Because a capture is a long-distance yeet.

Personally I find this pretty magical. It's not as efficient as a hand-rolled scene graph with instancing and built-in animation, but in terms of coding lift it's literally O(0) instead of OO. I needed to add zero lines of code to any of the 3 sub-systems, in order to combine them into one spinning whole.

The entire scene + instancing package clocks in at about 300 lines and that's including empties and generous formatting. I don't need to architect the rest of the framework around a base Object3D class that everything has to inherit from either, which is a-ok in my book.

This architecture will never reach Unreal or Unity levels of hundreds of thousands of draw calls, but then, it's not meant to do that. It embraces the idea of a unique shader for every draw call, and then walks that back if and when it's useful. The prototype map package for example does this, and can draw a whole 3D vector globe in 2 draw calls: fill and stroke. Adding labels would make it 3. And it's not static: it's doing the usual quad-tree of LOD'd mercator map tiles.

Multi-Pass

Next up, the modular renderer passes. Architecturally and reactively-speaking, there isn't much here. This was mainly an exercise in slicing apart the existing glue.

The key thing to grok is that in Use.GPU, the component does not correspond to a literal GPU render pass. Rather, it's a virtual, logical render pass. It represents all the work needed to draw some geometry to a screen or off-screen buffer, in its fully shaded form. This seems like a useful abstraction, because it cleanly separates the nitty gritty rendering from later compositing (e.g. overlays).

For the forward renderer, this means first rendering a few shadow maps, and possibly rendering a picking buffer for interaction. For the deferred renderer, this involves rendering the g-buffer, stencils, lights, and so on.

My goal was for the toggle between the two to be as simple as replacing a with a ... but also to have both of those be flexible enough that you could potentially add on, say, SSAO, or bloom, or a Space Engine-style black hole, as an afterthought. And each can have its own renderer, rather than shoehorning everything into one big engine.

Neatly, that's mostly what it is now. The basic principle rests on three pillars.

Deferred rendering

First, there are a few different rendering modes, by default solid vs shaded vs ui. These define what kind of information is needed at every pixel, i.e. the classic varying attributes. But they have no opinion on where the data comes from or what it's used for: that's defined by the geometry layer being rendered. It renders a draw call, which it gives e.g. a getVertex and getFragment shader function with a particular signature for that mode. These functions are not complete shaders, just the core functions, which are linked into a stub. There are a few standard 'tropes' used here, not just these two.

Second, there are a few different rendering buckets, like opaque, transparent, shadow, picking and debug. These are used to group draws into. Different GPU render passes then pick and choose from that. opaque and transparent are drawn to the screen, while shadow is drawn repeatedly into all the shadow maps. This includes sorting front-to-back and back-to-front, as well as culling.

Finally, there's the renderer itself (forward vs deferred), and its associated pass components (e.g. , , , and so on). The renderer decides how to translate a particular "mode + bucket" combination into a concrete draw call, by lowering it into render components (e.g. ). The pass components decide which buffer to actually render stuff to, and how. So the renderer itself doesn't actually render, it merely spawns and delegates to other components that do.

The forward path works mostly the same as before, only the culling and shadow maps are new... but it's now split up into all its logical parts. And I verified this design by adding the deferred renderer, which is a lot more convoluted, but still needs to do some forward rendering.

It works like a treat, and they use all the same lighting shaders. You can extend any of the 3 pillars just by replacing or injecting a new component. And you don't need to fork either renderer to do so: you can just pick and choose à la carte by selectively overriding or extending its "mode + bucket" mapping table, or injecting a new actual render pass.

To really put a bow on top, I upgraded the Use.GPU inspector so that you can directly view any render target in a RenderDoc-like way. This will auto-apply useful colorization shaders, e.g. to visualize depth. This is itself implemented as a Use.GPU Live canvas, sitting inside the HTML-based inspector, sitting on top of Live, which makes this a Live-in-React-in-Live scenario.

For shits and giggles, you can also inspect the inspector's canvas, recursively, ad infinitum. Useful for debugging the debugger:

There are still of course some limitations. If, for example, you wanted to add a new light type, or add support for volumetric lights, you'd have to reach in more deeply to make that happen: the resulting code needs to be tightly optimized, because it runs per pixel and per light. But if you do, you're still going to be able to reuse 90% of the existing components as-is.

I do want a more comprehensive set of light types (e.g. line and area), I just didn't get around to it. Same goes for motion vectors and TXAA. However, with WebGPU finally nearing public release, maybe people will actually help out. Hint hint.

Port of a Reaction Diffusion system by Felix Woitzel.

A Clusterfuck of Textures

A final thing to talk about is 2D image effects and how they work. Or rather, the way they don't work. It seems simple, but in practice it's kind of ludicrous.

If you'd asked me a year ago, I'd have thought a very clean, composable post-effects pipeline was entirely within reach, with a unified API that mostly papered over the difference between compute and render. Given that I can link together all sorts of crazy shaders, this ought to be doable.

Well, I did upgrade the built-in fullscreen conveniences a bit, so that it's now easier to make e.g. a reaction diffusion sim like this (full code):

The devil here is in the details. If you want to process 2D images on a GPU, you basically have several choices:

Use a compute shader or render shader?
Which pixel format do you use?
Are you sampling one flat image or a MIP pyramid of pre-scaled copies?
Are you sampling color images, or depth/stencil images?
Use hardware filtering or emulate filtering in software?

The big problem is that there is no single approach that can handle all cases. Each has its own quirks. To give you a concrete example: if you wrote a float16 reaction-diffusion sim, and then decided you actually needed float32, you'd probably have to rewrite all your shaders, because float16 is always renderable and hardware filterable, but float32 is not.

Use.GPU has a pretty nice set of Compute/Stage/Kernel components, which are elegant on the outside; but they require you to write pretty gnarly shader code to actually use them. On the other side are the RenderToTexture/Pass/FullScreen components which conceptually do the same thing, and have much nicer shader code, but which don't work for a lot of scenarios. All of them can be broken by doing something seemingly obvious, that just isn't natively supported and difficult to check ahead of time.

Even just producing universal code to display any possible texture type on screen becomes a careful exercise in code-generation. If you're familiar with the history of these features, it's understandable how it got to this point, but nevertheless, the resulting API is abysmal to use, and is a never-ending show of surprise pitfalls.

Here's a non-exhaustive list of quirks:

Render shaders are the simplest, but can only be used to write those pixel formats that are "renderable".
Compute shaders must be dispatched in groups of N, even if the image size is not a multiple of N. You have to manually trim off the excess threads.
Hardware filtering only works on some formats, and some filtering functions only work in render shaders.
Hardware filtering (fast) uses [0..1] UV float coordinates, software emulation in a shader (slow) uses [0..N] XY uint coordinates.
Reading and writing from/to the same render texture is not allowed, you have to bounce between a read and write buffer.
Depth+stencil images have their own types and have an additional notion of "aspect" to select one or both.
Certain texture functions cannot be called conditionally, i.e. inside an if.
Copying from one texture to another doesn't work between certain formats and aspects.

My strategy so far has been to try and stick to native WGSL semantics as much as possible, meaning the shader code you do write gets inserted pretty much verbatim. But if you wanted to paper over all these differences, you'd have to invent a whole new shader dialect. This is a huge effort which I have not bothered with. As a result, compute vs render pretty much have to remain separate universes, even when they're doing 95% the same thing. There is also no easy way to explain to users which one they ought to use.

While it's unrealistic to expect GPU makers to support every possible format and feature on a fast path, there is little reason why they can't just pretend a little bit more. If a texture format isn't hardware filterable, somebody will have to emulate that in a shader, so it may as well be done once, properly, instead of in hundreds of other hand-rolled implementations.

If there is one overarching theme in this space, it's that limitations and quirks continue to be offloaded directly onto application developers, often with barely a shrug. To make matters worse, the "next gen" APIs like Metal and Vulkan, which WebGPU inherits from, do not improve this. They want you to become an expert at their own kind of busywork, instead of getting on with your own.

I can understand if the WebGPU designers have looked at the resulting venn-diagram of poorly supported features, and have had to pick their battles. But there's a few absurdities hidden in the API, and many non-obvious limitations, where the API spec suggests you can do a lot more than you actually can. It's a very mixed bag all things considered, and in certain parts, plain retarded. Ask me about minimum binding size. No wait, don't.

* * *

Most promising is that as Use.GPU grows to do more, I'm not touching extremely large parts of it. This to me is the sign of good architecture. I also continue to focus on specific use cases to validate it all, because that's the only way I know how to do it well.

There are some very interesting goodies lurking inside too. To give you an example... that R3F client app I mentioned at the start. It leverages Use.GPU's state package to implement a universal undo/redo system in 130 lines. A JS patcher is very handy to wrangle the WebGPU API's deep argument style, but it can do a lot more.

One more thing. As a side project to get away from the core architecting, I made a viewer for levels for Dark Engine games, i.e. Thief 1 (1998), System Shock 2 (1999) and Thief 2 (2000). I want to answer a question I've had for ages: how would those light-driven games have looked, if we'd had better lighting tech back then? So it actually relights the levels. It's still a work in progress, and so far I've only done slow-ass offline CPU bakes with it, using a BSP-tree based raytracer. But it works like a treat.

I basically don't have to do any heavy lifting if I want to draw something, be it normal geometry, in-place data/debug viz, or zoomable overlays. Integrating old-school lightmaps takes about 10 lines of shader code and 10 lines of JS, and the rest is off-the-shelf Use.GPU. I can spend my cycles working on the problem I actually want to be working on. That to me is the real value proposition here.

I've noticed that when you present people with refined code that is extremely simple, they often just do not believe you, or even themselves. They assume that the only way you're able to juggle many different concerns is through galaxy brain integration gymnastics. It's really quite funny. They go looking for the complexity, and they can't find it, so they assume they're missing something really vital. The realization that it's simply not there can take a very long time to sink in.

Visit usegpu.live for more and to view demos in a WebGPU capable browser.

Get in Zoomer, We're Saving React

2022-09-23T00:00:00+02:00

Looking back, and forward

Lately, it seems popular to talk smack about React. Both the orange and red site recently spilled the tea about how mean Uncle React has been, and how much nicer some of these next-gen frameworks supposedly are.

I find this bizarre for two reasons:

Most next-gen React spin-offs strike me as universally regressive, not progressive.
The few exceptions don't seem to have any actual complex, battle-hardened apps to point to, to prove their worth.

Now, before you close this tab thinking "ugh, not another tech rant", let me first remind you that a post is not a rant simply because it makes you angry. Next, let me point out that I've been writing code for 32 years. You should listen to your elders, for they know shit and have seen shit. I've also spent a fair amount of time teaching people how to get really good at React, so I know the pitfalls.

You may also notice that not even venerated 3rd party developers are particularly excited about React 18 and its concurrent mode, let alone the unwashed masses. This should tell you the React team itself is suffering a bit of an existential crisis. The framework that started as just the V in MVC can't seem to figure out where it wants to go.

So this is not the praise of a React fanboy. I built my own clone of the core run-time, and it was exactly because its limitations were grating, despite the potential there. I added numerous extensions, and then used it to tackle one of the most challenging domains around: GPU rendering. If one person can pull that off, that means there's actually something real going on here. It ties into genuine productivity boons, and results in robust, quality software, which seems to come together as if by magic.

To put it differently: when Figma recently announced they were acquired for $20B by Adobe, we all intuitively understood just how much of an exceptional black swan event that was. We know that 99.99…% of software companies are simply incapable of pulling off something similar. But do we know why?

Where we came from

If you're fresh off the boat today, React can seem like a fixture. The now-ancient saying "Nobody ever got fired for choosing IBM" may as well be updated for React. Nevertheless, when it appeared on the scene, it was wild: you're going to put the HTML and CSS in the JavaScript? Are you mad?

Yes, it was mad, and like Galileo, the people behind React were completely right, for they integrated some of the best ideas out there. They were so right that Angular pretty much threw in the towel on its abysmal two-way binding system and redesigned it to adopt a similar one-way data flow. They were so right that React also dethroned the previous fixture in web land, jQuery, as the diff-based Virtual DOM obsoleted almost all of the trickery people were using to beat the old DOM into shape. The fact that you could use e.g. componentDidUpdate to integrate legacy code was just a conceit, a transition mechanism that spelled out its own obsolescence as soon as you got comfortable with it.

Many competing frameworks acted like this wasn't so, and stuck to the old practice of using templates. They missed the obvious lesson here: every templating language inevitably turns into a very poor programming language over time. It will grow to add conditionals, loops, scopes, macros, and other things that are much nicer in actual code. A templating language is mainly an inner platform effect. It targets a weird imagined archetype of someone who isn't allergic to code, but somehow isn't smart enough to work in a genuine programming language. In my experience, this archetype doesn't actually exist. Designers don't want to code at all, while coders want native expressiveness. It's just that simple.

Others looked at the Virtual DOM and only saw inefficiency. They wanted to add a compiler, so they could reduce the DOM manipulations to an absolute minimum, smugly pointing to benchmarks. This was often just premature optimization, because it failed to recognize the power of dynamic languages: that they can easily reconfigure their behavior at run-time, in response to data, in a turing-complete way. This is essential for composing grown up apps that enable user freedom. The use case that most of the React spin-offs seem to be targeting is not apps but web sites. They are paving cow paths that are well-worn with some minor conveniences, while never transcending them.

var RouterMixin = {
  contextTypes: {
    router: React.PropTypes.object.isRequired
  },

  // The mixin provides a method so that
  // components don't have to use the
  // context API directly.
  push: function(path) {
    this.context.router.push(path)
  }
};

var Link = React.createClass({
  mixins: [RouterMixin],

  handleClick: function(e) {
    e.stopPropagation();

    // This method is defined in RouterMixin.
    this.push(this.props.to);
  },

  render: function() {
    return (
      
        {this.props.children}
      
    );
  }
});

module.exports = Link;

React circa 2016

It's also easy to forget that React itself had many architectural revisions. When old farts like me got in on it, components still had mix-ins, because genuine classes were a distant dream in JS. When ES classes showed up, React adopted those, but it didn't fundamentally change the way you structured your code. It wasn't until React 16.8 (!) that we got hooks, which completely changed the way you approached it. This reduced the necessary boilerplate by an order of magnitude, and triggered a cambrian explosion of custom hook development. That is, at least until the buzz wore off, and only the good ideas remained standing.

Along the way, third party React libraries have followed a similar path. Solutions like Redux appeared, got popular, and then were ditched as people realized the boilerplate just wasn't worth it. It was a necessary lesson to learn.

This legacy of evolution is also where the bulk of React's perceived bloat sits today. As browsers evolved, as libraries got smarter, and as more people ditched OO, much of it is now indeed unnecessary for many use cases. But while you can tweak React with a leaner-and-meaner reimplementation, this doesn't fundamentally alter the value proposition, or invalidate the existing appeal of it.

The fact remains that before React showed up, nobody really had any idea how to make concepts like URL routers, or drag and drop, or UI design systems, truly sing, not on the web. We had a lot of individual pieces, but nothing solid to puzzle them together with. Nevertheless, there is actual undiscovered country beyond, and that's really what this post is about: looking back and looking forward.

If there's one solid criticism I've heard of React, it's this: that no two React codebases ever look alike. This is generally true, but it's somewhat similar to another old adage: that happy families all look alike, but every broken family is broken in its own particular way. The reason bad React codebases are bad is because the people who code it have no idea what they're supposed to be doing. Without a model of how to reason about their code in a structured way, they just keep adding on hack upon hack, until it's better to throw the entire thing away and start from scratch. This is no different from any other codebase made up as they go along, React or not.

Where React came from is easy to explain, but difficult to grok: it's the solution that Facebook arrived at, in order to make their army of junior developers build a reliable front-end, that could be used by millions. There is an enormous amount of hard-earned experience encoded in its architecture today. Often though, it can be hard to sort the wheat from the chaff. If you stubbornly stick to what feels familiar and easy, you may never understand this. And if you never build anything other than a SaaS-with-forms, you never will.

I won't rehash the specifics of e.g. useEffect here, but rather, drop in a trickier question: what if the problem people have with useEffect + DOM Events isn't the fault of hooks at all, but is actually the fault of the DOM?

I only mention it because when I grafted an immediate-mode style interaction model onto my React clone instead, I discovered that complex gesture controllers suddenly became 2-3x shorter. What's more, declaring data dependencies that "violate the rules of React" wasn't an anti-pattern at all: it was actually key to the entire thing. So when I hear that people are proudly trying to replace dependencies with magic signals, I just shake my head and look elsewhere.

Which makes me wonder… why is nobody else doing things like this? Immediate mode UI isn't new, not by a long shot. And it's hardly the only sticking point.

Mac OS X Leopard - 2007

Where we actually came from

Here's another thing you may not understand: just how good old desktop software truly was.

The gold standard here is Mac OS X, circa 2008. It was right before the iPhone, when Apple was still uniquely focused on making its desktop the slickest, most accessible platform around. It was a time when sites like Ars Technica still published real content, and John Siracusa would lovingly post multi-page breakdowns of every new release, obsessing over every detail for years on end. Just imagine: tech journalists actually knowing the ins-and-outs of how the sausage was made, as opposed to copy/pasting advertorials. It was awesome.

This was supported by a blooming 3rd party app ecosystem, before anyone had heard of an App Store. It resulted in some genuine marvels, which fit seamlessly into the design principles of the platform. For example, Adium, a universal instant messenger, which made other open-source offerings seem clunky and downright cringe. Or Growl, a universal notification system that paired seamlessly with it. It's difficult to imagine this not being standard in every OS now, but Mac enthusiasts had it years before anyone else.

The monopolistic Apple of today can't hold a candle to the extended Apple cinematic universe from before. I still often refer to the Apple Human Interface Guidelines from that era, rather than the more "updated" versions of today, which have slowly but surely thrown their own wisdom in the trash.

The first section of three, Application Design Fundamentals, has almost nothing to do with Macs specifically. You can just tell from the chapter titles:

The Design Process
Characteristics of Great Software
Human Interface Design
Prioritizing Design Decisions

Like another favorite, The Design of Everyday Things, it approaches software first and foremost as tools designed for people to use. The specific choices made in app design can be the difference between something that's a joy to use and something that's resented and constantly fought against.

So what exactly did we lose? It's quite simple: by moving software into the cloud and turning them into web-based SaaS offerings, many of the basic affordances that used to be standard have gotten watered down or removed entirely. Here are some examples:

Menus let you cross over empty space and other menu items, instead of strictly enforcing hover rectangles.

You can drag and drop the file icon from a document's titlebar e.g. to upload it, instead of having to go look for it again.

Holding keyboard modifiers like CTRL or ALT is reflected instantly in menus, and used to make power-user features discoverable-yet-unobtrusive.

And here are some more:

You can browse years of documents, emails, … and instantly draft new ones. Fully off-line, with zero lag.
You can sort any table by any column, and it will remember prior keys to produce a stable sort for identical values.
Undo/redo is standard and expected, even when moving entire directories around in the Finder.
Copy/pasting rich content is normal, and entirely equivalent to dragging and dropping it.
When you rename or move a file that you're editing, its window instantly reflects the new name and location.
You can also drag a file into an "open file" dialog, to select it there.
When downloading a file, the partial download has a progress bar on the icon. It can be double clicked to resume, or even copied to another machine.

It's always amusing to me to watch a power user switch to a Mac late in life, because much of their early complaints stem from not realizing there are far more obvious ways to do what they've trained themselves to do in a cumbersome way.

On almost every platform, PDFs are just awful to use. Whereas out-of-the-box on a Mac, you can annotate them to your heart's content, or drag pages from one PDF to another to recompose it. You can also sign them with a signature read from your webcam, for those of us who still know what pens are for. This is what happens when you tell companies like Adobe to utterly stuff it and just show them how it's supposed to be done, instead of waiting for their approval. The productivity benefits were enormous.

As an aside, if all of this seems quaint or positively boomeresque, here's a tip: forcing yourself to slow down and work with information directly, with your hands, manipulating objects physical or virtual, instead of offloading it all to a cloud… this is not an anti-pattern. Neither is genuine note taking on a piece of paper. You should try it sometime.

At the time, many supposed software experts scoffed at Apple, deriding their products as mere expensive toys differentiated purely by "marketing". But this is the same company that seamlessly transitioned its entire stack from PowerPC, to x86, to x64, and eventually ARM, with most users remaining blissfully unaware this ever took place.

This is what the pinnacle of our craft can actually look like.

Apple didn't just knock it out of the park when it came to the OS or the overall UI: they also shipped powerful first-party apps like iMovie and Keynote, which made competing offerings look positively shabby. Steve Jobs used them for his own keynotes, arguably the best in the business.

Similarly, what set the iPhone apart was not just its touch interface, but that they actually ported a mature media and document stack to mobile wholesale. At that time, the "mobile web" was a complete and utter joke, and it would take Android years to catch up, whether it was video or music, or basic stuff like calendar invites and contacts.

It has nothing to do with marketing. Indeed, while many companies have since emulated and perfected their own Apple-style pitch, almost no-one manages to get away from that tell-tale "enterprise" feel. They don't know or care how their users actually want to use their products: the people in charge don't have the first clue about the fundamentals of product design. They just like shiny things when they see them.

iMovie - 2010

The Reactive Enterprise

What does any of this have to do with React? Well it's very simple. Mac OS X was the first OS that could actually seriously claim to be reactive.

The standard which virtually everyone emulated back then was Windows. And in Windows, the norm—which mostly remains to this day—is that when you query information, that information is fetched once, and never updated. The user was just supposed to know that in order to see it update, they had to manually refresh it, either by bumping a selection back and forth, or by closing and reopening a dialog.

Windows 95

The same applied to preferences: in Windows land, the established pattern was to present a user with a set of buttons, the triad of Ok, Cancel and Apply. This is awful, and here's why. If you click Ok, you are committing to a choice you haven't yet had the chance to see the implications of. If you click Cancel, you are completely discarding everything you did, without ever trying it out. If you click Apply, it's the same as pressing Ok, just the window stays open. None of the 3 buttons let you interact confidently, or easily try changes one by one, reinforcing the idea that it's the user's fault for being "bad at computers" if it doesn't do what they expect, or they don't know how to back out.

The bold Mac solution was that toggling a preference should take effect immediately. Even if that choice affects the entire desktop, such as changing the UI theme. So if that's not what you wanted, you simply clicked again to undo it right away. Macs were reactive, while Windows was transactional. The main reason it worked this way was because most programmers had no clue how to effectively make their software respond to arbitrary changes, and Microsoft couldn't go a few years without coming up with yet another ill-conceived UI framework.

This divide has mostly remained, with the only notable change being that on mobile devices, both iOS and Android tend to embrace the reactive model. However, given that much of the software used is made partially or wholly out of web views, this is a promise that is often violated and rarely seen as an inviolable constraint. It's just a nice-to-have. Furthermore, while it has become easier to display reactive information, the crucial second half of the equation—interaction—remains mostly neglected, also by design.

Tacoma Narrows bridge collapse (1940)

Hyatt Regency walkway collapse (1981)

I'm going to be cheeky and say if there's anyone who should take the blame for this, it's back-end engineers and the technology choices they continue to make. The very notion of "back-end" is a fallacy: it implies that one can produce a useful, working system, without ever having to talk to end-users.

Just imagine how alien this concept would be to an engineer before the software revolution happened: it'd be like suggesting you build a bridge without ever having to think about where it sits or who drives over it, because that's just "installation" and "surfacing". In civil engineering, catastrophes are rare, and each is a cautionary tale, never to be repeated: the loss of life was often visceral and brutal. But in software, we embraced never learning such lessons.

A specific evil here is the legacy of SQL and the associated practices, which fragments and normalizes data into rigid tables. As a result, the effect of any change is difficult to predict, and virtually impossible to reliably undo or synchronize after the fact.

This is also the fault of "enterprise", in a very direct sense: SQL databases and transactions are mainly designed to model business processes. They evolved to model bureaucratic workflows in actual enterprises, with a clear hierarchy of command, a need to maintain an official set of records, with the ability for auditing and oversight.

However, such classic enterprises were of course still run by people, by individuals. The bulk of the work they did was done offline, producing documents, spreadsheets and other materials through direct interaction and iteration. The bureaucracy was a means to an end, it wasn't the sole activity. The idea of an organization or country run entirely on bureaucracy was the stuff people made satirical movies about.

And yet, many jobs now follow exactly this template. The activity is entirely coordinated and routed through specific SaaS apps, either off-the-shelf or bespoke, which strictly limit the available actions. They only contain watered down mockeries of classic desktop concepts such as files and folders, direct manipulation of data, and parallel off-line workstreams. They have little to no affordances for drafts, iteration or errors. They are mainly designed to appeal to management, not the riff-raff.

The promise of adopting such software is that everything will run more smoothly, and that oversight becomes effortless thanks to a multitude of metrics and paper trails. The reality is that you often replace tasks that ordinary, capable employees could do themselves, with a cumbersome and restrictive process. Information becomes harder to find, mistakes are more difficult to correct, and the normal activity of doing your job is replaced with endless form filling, box-ticking and notification chasing. There is a reason nobody likes JIRA, and this is it.

What's more, by adopting SaaS, companies put themselves at the mercy of someone else's development process. When dealing with an unanticipated scenario, you often simply can't work around it with the tools given, by design. It doesn't matter how smart or self-reliant the employees are: the software forces them to be stupid, and the only solution is to pay the vendor and wait 3 months or more.

For some reason, everyone has agreed that this is the way forward. It's insane.

Oracle Cloud with AI Bullshit

Circling Back

Despite all its embedded architectural wisdom, this is a flaw that React shares: it was never meant to enable user freedom. Indeed, the very concept of Facebook precludes it, arguably the world's biggest lock-in SaaS. The interactions that are allowed there are exactly like any other SaaS: GET and POST to a monolithic back-end, which enforces rigid processes.

As an app developer, if you want to add robust undo/redo, comfy mouse interactions and drag-n-drop, keyboard shortcuts, and all the other goodies that were standard on the desktop, there are no easy architectural shortcuts available today. And if you want to add real-time collaboration, practically a necessity for real apps, all of these concerns spill out, because they cannot be split up neatly into a wholly separate front-end and back-end.

A good example is when people mistakenly equate undo/redo with a discrete, immutable event log. This is fundamentally wrong, because what constitutes an action from user's point of view is entirely different from how a back-end engineer perceives it. For example undo/redo needs to group multiple operations to enable sensible, logical checkpoints… but it also needs to do so on the fly, for actions which are rapid and don't conflict.

If you don't believe me, go type some text in your text editor and see what happens when you press CTRL-Z. It won't erase character by character, but did you ever think about that? Plus, if multiple users collaborate, each needs their own undo/redo stack, which means you need the equivalent of git rebasing and merging. You'd be amazed how many people don't realize this.

If we want to move forward, surely, we should be able to replicate what was normal 20 years ago?

Real-time databases

There are a few promising things happening in the field, but they are so, so rare… like the slow-death-and-rebirth of Firebase into open-source alternatives and lookalikes. But even then, robust real-time collaboration remains a 5-star premium feature.

Similarly, big canvas-based apps like Figma, and scrappy upstarts like TLDraw have to painstakingly reinvent all the wheels, as practically all the relevant knowledge has been lost. And heaven forbid you actually want a decent, GPU-accelerated renderer: you will need to pay a dedicated team of experts to write code nobody else in-house can maintain, because the tooling is awful and also they are scared of math.

What bugs me the most is that the React dev team and friends seem extremely unaware of any of this. The things they are prioritizing simply don't matter in bringing the quality of the resulting software forward, except at the margins. It'll just load the same HTML a bit faster. If you stubbornly refuse to learn what memo(…) is for, it'll render slightly less worse. But the advice they give for event handling, for data fetching, and so on… for advanced use it's simply wrong.

A good example is that GraphQL query subscriptions in Apollo split up the initial GET from the subsequent SUBSCRIBE. This means there is always a chance one or more events were dropped in between the two. Nevertheless, this is how the library is designed, and this is what countless developers are doing today. Well okay then.

Another good example is implementing mouse gestures, because mouse events happen quicker than React can re-render. Making this work right the "proper way" is an exercise in frustration, and eventually you will conclude that everything you've been told about non-HTML-element useRef is a lie: just embrace mutating state here.

In fact, despite being told this will cause bugs, I've never had any issues with it in React 17. This leads me to suspect that what they were really doing was trying to prevent people from writing code that would break in React 18's concurrent mode. If so: dick move, guys. Here's what I propose: if you want to warn people about "subtle bugs", post a concrete proof, or GTFO.

* * *

If you want to build a truly modern, robust, desktop-class web app with React, you will find that you still need to pretty much make apple pie from scratch, by first re-inventing the entire universe. You can try starting with the pre-made stuff, but you will hit a wall, and/or eventually corrupt your users' data. It's simply been my experience, and I've done the React real-time collaboration rodeo with GPU sprinkles on top multiple times now.

Crucially, none of the React alternatives solve this, indeed, they mostly just make it worse by trying to "helpfully" mutate state right away. But here's the annoying truth: you cannot skip learning to reason about well-ordered orchestration. It will just bite you in the ass, guaranteed.

What's really frustrating about all this is how passive and helpless the current generation of web developers seem to be in all this. It's as if they've all been lulled into complacency by convenience. They seem afraid to carve out their own ambitious paths, and lack serious gusto for engineering. If there isn't a "friendly" bot spewing encouraging messages with plenty of 👏 emoji at every turn, they won't engage.

As someone who took a classical engineering education, which included not just a broad scientific and mathematical basis, but crucially also the necessary engineering ethos, this is just alien to me. Call me cynical all you want, but it matches my experience. Coming after the generation that birthed Git and BitTorrent, and which killed IE with Firefox and Konqueror/WebKit, it just seems ridiculous.

Fuck, most zoomers don't even know how to dance. I don't mean that they are bad at dancing, I mean they literally won't try, and just stand around awkwardly.

Just know: nobody else is going to do it for you. So what are you waiting for?

The GPU Banana Stand

2022-07-21T00:00:00+02:00

Freshly whipped WebGPU, with ice cream

I recently rolled out version 0.7 of Use.GPU, my declarative/reactive WebGPU library.

This includes features and goodies by itself. But most important are the code patterns which are all nicely slotting into place. This continues to be welcome news, even to me, because it's a novel architecture for the space, drawing heavily from both reactive web tech and functional programming.

Some of the design choices are quite different from other frameworks, but that's entirely expected: I am not seeking the most performant solution, but the most composable. Nevertheless, it still has fast and minimal per-frame code, with plenty of batching. It just gets there via an unusual route.

WebGPU is not available for general public consumption yet, but behind the dev curtain Use.GPU is already purring like a kitten. So I mainly want more people to go poke at it. Cos everything I've been saying about incrementalism can work, and does what it says on the box. It's still alpha, but there are examples and documentation for the parts that have stabilized, and most importantly, it's already pretty damn fun.

If you have a dev build of Chrome or Firefox on hand, you can follow along with the actual demos. For everyone else, there's video.

Immediate + Retained

To recap, I built a clone of the core React run-time, called Live, and used it as the basis for a set of declarative and reactive components.

Here's how I approached it. In WebGPU, to render 1 image in pseudo code, you will have something like:

const main = (props) => {
  const device = useGPUDevice(); // access GPU
  const resource = useGPUResource(device); // allocate a resource

  // ...

  dispatch(device, ...); // do some compute
  draw(device, resource, ...); // and/or do some rendering
};

This is classic imperative code, aka immediate mode. It's simple but runs only once.

The classic solution to making this interactive is to add an event loop at the bottom. You then need to write specific code to update specific resources in response to specific events. This is called retained mode, because the resources are all created once and explicitly kept. It's difficult to get right and gets more convoluted as time goes by.

Declarative programming says instead that if you want to make this interactive, this should be equivalent to just calling main repeatedly with new input props aka args. Each use…() call should then either return the same thing as before or not, depending on whether its arguments changed: the use prefix signifies memoization, and in practice this involves React-like hooks such as useMemo or useState.

In a declarative model, resources can be dropped and recreated on the fly in response to changes, and code downstream is expected to cope. Existing resources are still kept somewhere, but the retention is implicit and hands-off. This might seem like an enormous source of bugs, but the opposite is true: if any upstream value is allowed to change, that means you are free to pass down changed values whenever you like too.

That's essentially what Use.GPU does. It lets you write code that feels immediate, but is heavily retained on the inside, tracking fine grained dependencies. It does so by turning every typical graphics component into a heavily memoized constructor, while throwing away most of the other usual code. It uses so instead of dispatch() you write , but the principle remains the same.

Like React, you don't actually re-run all of main(...) every time: every boundary is actually a resume checkpoint. If you crack open a random Use.GPU component, you will see the same main() shape inside.

A Live component tree, showing changes in green.

3 in 1

Live goes far beyond the usual React semantics, introducing continuations, tree reductions, captures, and more. These are used to make the entire library self-hosted: everything is made out of components. There is no special layer underneath to turn the declarative model into something else. There is only the Live run-time, which does not know anything about graphics or GPUs.

The result is a tree of functions which is simultaneously:

an execution trace
the application state
a dependency graph of that state

When these 3 concerns are aligned, you get a fully incremental program. It behaves like a big reactive AST expression that builds and rewrites itself. This way, Live is an evolution of React into a fully rewindable, memoized effect run-time.

That's a mouthful, but when working with Use.GPU, it all comes down to that main() function above. This is exactly the mental model you should be having. All the rest is just window dressing to assemble it.

Instead of hardcoded draw() calls, there is a loop for (let task of tasks) task(). Maintaining that list of tasks is what all the reactivity is ultimately in service of: to apply minimal changes to the code to be run every frame, or the resources it needs. And to determine if it needs to run at all, or if we're still good.

So the tree in Use.GPU is executable code knitting itself together, and not data at all. This is very different from most typical scene trees or render graphs: these are pure data representations of objects, which are traversed up and down by static code, chasing existing pointers.

The tree form captures more than hierarchy. It also captures order, which is crucial for both dispatch sequencing and 2D layering. Live map-reduce lets parents respond to children without creating cycles, so it's still all 100% one-way data flow. It's like a node graph, but there is no artificial separation between the graph and the code.

You already have to decide where in your code particular things happen; a reactive tree is merely a disciplined way to do that. Like a borrow checker, it's mainly there for your own good, turning something that would probably work fine in 95% of cases into something that works 100%. And like a borrow checker, you will sometimes want to tell it to just f off, and luckily, there are a few ways to do that too.

The question it asks is whether you still want to write classic GPU orchestration code, knowing that the first thing you'll have to do is allocate some resources with no convenient way to track or update them. Or whether you still want to use node-graph tools, knowing that you can't use functional techniques to prevent it from turning into spaghetti.

If this all sounds a bit abstract, below are more concrete examples.

Compute Pipelines

One big new feature is proper support for compute shaders.

GPU compute is meant to be rendering without all the awful legacy baggage: just some GPU memory buffers and some shader code that does reading and writing. Hence, compute shaders can inherit all the goodness in Use.GPU that has already been refined for rendering.

I used it to build a neat fluid dynamics smoke sim example, with fairly decent numerics too.

The basic element of a compute pipeline is just . This takes a shader, a workgroup count, and a few more optional props. It has two callbacks, one whether to dispatch conditionally, the other to initialize just-in-time data. Any of these props can change at any time, but usually they don't.

If you place this anywhere inside a ..., it will run as expected. WebGPU will manage the device, while Compute will gather up the compute calls. This simple arrangement can also recover from device loss. If there are other dispatches or computes beside it, they will be run in tree order. This works because WebGPU provides a DeviceContext and gathers up dispatches from children.

This is just minimum viable compute, but not very convenient, so other components build on this:

- creates a buffer of a particular format and size. It can auto-size to the screen, optionally at xN resolution. This can also track N frames of history, like a rotating double or triple buffer. You can use it as a data source, or pass it to to write to it.

- wraps and runs a compute shader once for every sample in the target. It has conveniences to auto-bind buffers with history, as well as textures and uniforms. It can cycle history every frame. It will also read workgroup size from the shader code and auto-size the dispatch to match the input on the fly.

With these ingredients, a fluid dynamics sim (without visualization) becomes:

The expanded result.

,
    // Divergence
    ,
    // Curl
    ,
    // Pressure
    
  ]}
  then={([
    velocity,
    divergence,
    curl,
    pressure,
  ]: StorageTarget[]) => (
    
      
        
          
            
          
          
            
              
            
          
          
            
            
            
            
            
          
        
      
    
  )
/>

Explaining why this simulates smoke is beyond the scope of this post, but you can understand most of what it does just by reading it top to bottom:

It will create 4 data buffers: velocity, divergence, curl and pressure
It will set up 3 compute stages in order, targeting the different buffers.
It will run a series of compute kernels on those targets, using the output of one kernel as the input of the next.
All this will loop live.

Each of the shaders is imported directly from a .wgsl file, because shader closures are a native data type in Use.GPU.

The appearance of in the middle mirrors the React mechanism of the same name. Here it will defer execution until all the shaders have been compiled, preventing a partial pipeline from running. The semantics of Suspense are realized via map-reduce over the tree inside: if any of them yeet a SUSPEND symbol, the entire tree is suspended. So it can work for anything, not just compute dispatches.

What is most appealing here is the ability to declare data sources, name them using variables, and just hook them up to a big chunk of pipeline. You aren't forced to use excessive nesting like in React, which comes with its own limitations and ergonomic issues. And you don't have to generate monolithic chunks of JSX, you can use normal code techniques to organize that part too.

A tree of layout components, reduced into shapes, reduced into layers.

HTML/GPU

The fluid sim example includes a visualization of the 3 internal vector fields. This leverages Use.GPU's HTML-like layout system. But the 3 "divs" are each directly displaying a GPU buffer.

The data is colored using a shader, defined using a wgsl template.

const debugShader = wgsl`
  @link fn getSample(i: u32) -> vec4 {};
  @link fn getSize() -> vec4 {};
  @optional @link fn getGain() -> f32 { return 1.0; };

  fn main(uv: vec2) -> vec4 {
    let gain = getGain(); // Configurable parameter
    let size = getSize(); // Source array size

    // Convert 2D UV to linear index
    let iuv = vec2(uv * vec2(size.xy));
    let i = iuv.x + iuv.y * size.x;

    // Get sample and apply orange/blue color palette
    let value = getSample(i).x * gain;
    return sqrt(vec4(value, max(value * .1, -value * .3), -value, 1.0));
  }
`;

const DEBUG_BINDINGS = bundleToAttributes(debugShader);

const DebugField = ({field, gain}) => {
  const boundShader = useBoundShader(
    debugShader,
    DEBUG_BINDINGS,
    [field, () => field.size, gain || 1]
  );
  const textureSource = useLambdaSource(boundShader, field);
  return (
    
  );
};

Above, the DebugField component binds the coloring shader to a vector field. It turns it into a lambda source, which just adds array size metadata (by copying from field).

DebugField returns an with the shader as its image. This works because the equivalent of CSS background-image in Use.GPU can accept a shader function (uv: vec2) -> vec4.

So this is all that is needed to slap a live, procedural texture on a UI element. You can use all the standard image alignment and sizing options here too, because why wouldn't you?

Most UI elements are simple and share the same basic archetype, so they will be batched together as much as drawing order allows. Elements with unique shaders however are realized using 1 draw call per element, which is fine because they're pretty rare.

This part is not new in 0.7, it's just gotten slightly more refined. But it's easy to miss that it can do this. Where web browsers struggle to make their rendering model truly extensible, Use.GPU instead invites you to jump right in using first-class tools. Cos again: shader closures are a native data type the same way that there was money in that banana stand. I don't know how to be any clearer than this.

The shader snippets will end up inlined in the right places with all the right bindings, so you can just go nuts.

Dual Contouring

3D plotting isn't complete without rendering implicit surfaces. In WebGL this was very hard to do well, but in WebGPU it's entirely doable. Hence there is a that can generate a surface for any level in a volume. I chose dual contouring over e.g. marching cubes because it's always topologically sound, and also easy to explain.

Given a volume of data, you can classify each data point as inside or outside. You can then create a "minecraft" or "q-bert" mesh of cube faces, which cleanly separates all inside points from outside. This mesh will be topologically closed, provided it fits within the volume.

BorisTheBrave.com

In practice, you check every X, Y and Z edge between every adjacent pair of points, and place a cube face that sits across perpendicular. This creates cubes that are offset by half a cell, which is where the "dual" in the name comes from.

The last step is to make it smooth by projecting all the vertices onto the actual surface (as best you can), somewhere inside each containing cell. For "proper" dual contouring, this uses both the field and its gradients, using a difficult-to-stabilize least-squares fit. But high quality gradients are usually not available for numeric data, so I use a simpler linear technique, which is more stable.

The resulting mesh looks smooth, but does not have clean edges on the volume boundary, revealing the cube-shaped nature. To hide this, I generate a border of 1 additional cell in each direction. This is trimmed off from the final mesh using a per-pixel scissor in a shader. I also apply anti-aliasing similar to SDFs, so it's indistinguishable from actual mesh edges.

is the currently the most complex geometry component in the whole set. But in use, it's a simple layer which you just feed volume data to get a shaded mesh. On the inside it's realized using 2 compute dispatches and an indirect draw call, as well as a non-trivial vertex and fragment shader. It also plays nice with the lighting system, and the material system, the transform system, and so on, each of which comes from the surrounding context.

I'm very happy with the result, though I'm pretty disappointed in compute shaders tbh. The GPU ergonomics are plain terrible: despite knowing virtually nothing about the hardware you're on, you're expected to carefully optimize your dispatch size, memory access patterns, and more. It's pretty absurd.

The most basic case of "embarrassingly parallel shader" isn't even optimized for: you have to dispatch at least as many threads as the hardware supports, or it may have up to 2x, 4x, 8x... slowdown as X% sits idle. Then, with a workgroup size of e.g. 64, if the data length isn't a multiple of 64, you have to manually trim off those last threads in the shader yourself.

There are basically two worlds colliding here. In one world, you would never dream to size anything other than some (multiple of) power-of-two, because that would be inefficient. In the other world, it's ridiculous to expect that data comes in power-of-two sizes. In some ways, this is the real GPU ↔︎ CPU gap.

Use.GPU obviously chooses the world where such trade-offs are unreasonable impositions. It has lots of ergonomics around getting data in, in various forms, and it tries to paper over differences where it can.

Transforms and Differentials

Most 3D engines will organize their objects in a tree using matrix transforms.

In React or Live, this is trivial because it maps to the normal component update cycle, which is batched and dispatched in tree order. You don't need dirty flags: if a matrix changes somewhere, all children affected by it will be re-evaluated.

const Node = ({matrix, children}) => {
  const parent = useContext(MatrixContext);
  const combined = matrixMultiply(parent, matrix);
  return provide(MatrixContext, combined, children);
};

This is a common theme in Use.GPU: a mechanism that normally would have to be coded disappears almost entirely, because it can just re-use native tree semantics. However, Use.GPU goes much further. Matrix transforms are just one kind of transform. While they are a very convenient sweet spot, it's insufficient as a general case.

So its TransformContext doesn't hold a matrix, it holds any shader function vec4 -> vec4. This operates on the positions. When you nest one transform in the other, it will chain both shader functions in series. The transforms are inlined directly into the affected vertex shaders. If a transform changes, downstream draw calls can incorporate it and get new shaders.

If you used this for ordinary matrices, they wouldn't merge and it would waste GPU cycles. Hence there are still classic matrix transforms in e.g. the GLTF package. This then compacts into a single vec4 -> vec4 transform per mesh, which can compose with other, general transforms.

You can compose e.g. a spherical coordinate transform with a stereographic one, animate both, and it works.

It's weird, but I feel like I have to stress and justify that this is Perfectly Fine™... even more, that it's Okay To Do Transcendental Ops In Your Vertex Shader, because I do. I think most graphics dev readers will grok what I mean: focusing on performance-über–alles can smother a whole category of applications in the crib, when the more important thing is just getting to try them out at all.

Dealing with arbitrary transforms poses a problem though. In order to get proper shading in 3D, you need to transform not just the positions, but also the tangents and normals. The solution is a DifferentialContext with a shader function (vector: vec4, base: vec4, contravariant: bool) -> vec4. It will transform the differential vector at a point base in either a covariant (tangent) or contravariant (normal) way.

There's also a differential combinator: it can chain analytical differentials if provided, transforming the base point along. If there's no analytic differential, it will substitute a numeric one instead.

You can e.g. place an implicit surface inside a cylindrical transform, and the result will warp and shade correctly. Differential indicators like tick marks on axes will also orient themselves automatically. This might seem like a silly detail, but it's exactly this sort of stuff that I'm after: ways to make 3D graphics parts more useful as general primitives to build on, rather than just serving as a more powerful triangle blaster.

It's all composable, so all optional. If you place a simple GLTF model into a bare draw pass, it will have a classic projection × view × model vertex shader with vanilla normals and tangents. In fact, if your geometry isn't shaded, it won't have normals or tangents at all.

Content like map tiles also benefits from Use.GPU's sophisticated z-biasing mechanism, to ensure correct visual layering. This is an evolution of classic polygon offset. The crucial trick here is to just size the offset proportionally to the actual point or line width, effectively treating the point as a sphere and the line as tube. However, as Use.GPU has 2.5D points and lines, getting this all right was quite tricky.

But, setting zBias={+1} on a line works to bias it exactly over a matching surface, regardless of the line width, regardless of 2D vs 3D, and regardless of which side it is viewed from. This is IMO the API that you want. At glancing angles zBias automatically loses effect, so there is no popping.

A DSL for DSLs

You could just say "oh, so this is just a domain-specific language for render and compute" and wonder how this is different from any previous plug-and-play graphics solution.

Well first, it's not a proxy for anything else. If you want to do something that you can't do with , you aren't boxed in, because a is just a with bells on. Even then, is also replaceable, because a is just a of a lambda you could write yourself. And a is ultimately also a yeet, of a per-frame lambda that calls the individual kernel lambdas.

This principle is pervasive throughout Use.GPU's API design. It invites you to use its well-rounded components as much as possible, but also, to crack them open and use the raw parts if they're not right for you. These components form a few different play sets, each suited to particular use cases and levels of proficiency. None of this has the pretense of being no-code; it merely does low-code in a way that does not obstruct full-code.

You can think of Use.GPU as a process of run-time macro-expansion. This seems quite appropriate to me, as the hairy problem being solved is preparing and dispatching code for another piece of hardware.

Second, there is a lot of value in DSLs for pipeline-like things. Graphs are just no substitute for real code, so DSLs should be real programming languages with escape hatches baked in by default. Much of the value here isn't in the comp-sci cred, but rather in the much harder work of untangling the mess of real-time rendering at the API level.

The resulting programs also have another, notable quality: the way they are structured is a pretty close match to how GPU code runs... as async dispatches of functions which are only partially ordered, and mainly only at the point where results are gathered up. In other words, Use.GPU is not just a blueprint for how the CPU side can look, it also points to a direction where CPU and GPU code can be made much more isomorphic than today.

When fully expanded, the resulting trees can still be quite the chonkers. But every component has a specific purpose, and the data flow is easy to follow using the included Live Inspector. A lot of work has gone into making the semantics of Live legible and memorable.

Quoting: it's just like Lisp, but incremental.

Re-re-re-concile

The neatest trick IMO is where the per-frame lambdas go when emitted.

In 0.7, Live treats the draw calls similar to how React treats the HTML DOM: as something to be reconciled out-of-band. But what is being reconciled is not HTML, it's just other Live JSX, which ends up in a new part of the current tree. So this will also run it. You can even portal back and forth at will between the two sub-trees, while respecting data causality and context scope.

Along the way Live has gained actual bona-fide and operators, to drive this recursive . This means Use.GPU now neatly sidesteps Greenspun's law by containing a complete and well-specified version of a Lisp. Score.

You could also observe that the Live run-time could itself be implemented in terms of Quote and Unquote, and you would probably be correct. But this is the kind of code transform that would buy only a modicum of algorithmic purity at the cost of a lot of performance. So I'm not going there, and leave that exercise for the programming language people. And likely that would eventually result in an optimization pass to bring it closer to what it already is today.

My real point is, when you need to write code to produce code, it needs to be Lisp or something very much like it. But not because of purity. It's because otherwise you will end up denying your API consumers affordances you would find essential yourself.

Typescript is not the ideal language to do this in, but under the circumstances, it is one of the least worst. AFAIK no language has the resumable generator semantics Live has, and I need a modern graphics API too, so practical concerns win out instead. Mirroring React is also good, because the tooling for it is abundant, and the patterns are well known by many.

This same tooling is also what lets me import WGSL into TS without reinventing all the wheels, and just piggy backing on the existing ES module system. Though try getting Node.js, TypeScript and Webpack to all agree what a .wgsl module should be for, it's uh... a challenge.

* * *

The story of Use.GPU continues to evolve and continues to get simpler too. 0.7 makes for a pretty great milestone, and the roadmap is looking pretty green already.

There are still a few known gaps and deliberate oversights. This is in part because Use.GPU focuses on use cases that are traditionally neglected in graphics engines: quality vector graphics, direct data visualization, generative geometry, scalable UI, and so on. It took months before I ever added lighting and PBR, because the unlit, unshaded case had enough to chew on by itself.

Two obvious missing features are post-FX and occlusion culling.

Post-FX ought to be a straightforward application of the same pipelines from compute. However, doing this right also means building a good solution for producing derived render passes, such as normal and depth. The same also applies to shadow maps, which are also absent for the same reason.

Occlusion culling is a funny one, because it's hard to imagine a graphics renderer without it. The simple answer is that so far I haven't needed it because rendering 3D worlds is not something that has come up yet. My Subpixel SDF visualization example reached 1 million triangles easily, without me noticing, because it wasn't an issue even on an older laptop.

Most of those triangles are generative points and lines, drawn directly from compact source data:

This is the same video from last time, I know, but here's the thing:

There is not a single browser engine where you could dump a million elements into a page and still have something that performs, at all. Just doesn't exist. In Use.GPU you can get there by accident. On a single thread too. Without the indirection of a retained DOM, you just have code that reduces code that dispatches code to produce pixels.

The Case for Use.GPU

2022-06-14T00:00:00+02:00

Reinventing rendering one shader at a time

The other day I ran into a perfect example of exactly why GPU programming is so foreign and weird. In this post I will explain why, because it's a microcosm of the issues that lead me to build Use.GPU, a WebGPU rendering meta-framework.

What's particularly fun about this post is that I'm pretty sure some seasoned GPU programmers will consider it pure heresy. Not all though. That's how I know it's good.

GLTF model, rendered with Use.GPU GLTF

A Big Blob of Code

The problem I ran into was pretty standard. I have an image at size WxH, and I need to make a stack of smaller copies, each half the size of the previous (aka MIP maps). This sort of thing is what GPUs were explicitly designed to do, so you'd think it would be straight-forward.

If this was on a CPU, then likely you would just make a function downScaleImageBy2 of type Image => Image. Starting from the initial Image, you apply the function repeatedly, until you end up with just a 1x1 size image:

let makeMips = (image: Image, n: number) => {
  let images: Image[] = [image];
  for (let i = 1; i < n; ++i) {
    image = downScaleImageBy2(image);
    images.push(image);
  }
  return images;
}

On a GPU, e.g. WebGPU in TypeScript, it's a lot more involved. Something big and ugly like this... feel free to scroll past:

// Uses:
// - device: GPUDevice
// - format: GPUTextureFormat (BGRA or RGBA)
// - texture: GPUTexture (the original image + initially blank MIPs)

// A vertex and pixel shader for rendering vanilla 2D geometry with a texture
let MIP_SHADER = `
  struct VertexOutput {
    @builtin(position) position: vec4,
    @location(0) uv: vec2,
  };

  @stage(vertex)
  fn vertexMain(
    @location(0) uv: vec2,
  ) -> VertexOutput {
    return VertexOutput(
      vec4(uv * 2.0 - 1.0, 0.5, 1.0),
      uv,
    );
  }

  @group(0) @binding(0) var mipTexture: texture_2d;
  @group(0) @binding(1) var mipSampler: sampler;

  @stage(fragment)
  fn fragmentMain(
    @location(0) uv: vec2,
  ) -> @location(0) vec4 {
    return textureSample(mipTexture, mipSampler, uv);
  }
`;

// Compile the shader and set up the vertex/fragment entry points
let module = device.createShaderModule(MIP_SHADER);
let vertex = {module, entryPoint: 'vertexMain'};
let fragment = {module, entryPoint: 'fragmentMain'};

// Create a mesh with a rectangle
let mesh = makeMipMesh(size);

// Upload it to the GPU
let vertexBuffer = makeVertexBuffer(device, mesh.vertices);

// Make a texture view for each MIP level
let views = seq(mips).map((mip: number) => makeTextureView(texture, 1, mip));

// Make a texture sampler that will interpolate colors
let sampler = makeSampler(device, {
  minFilter: 'linear',
  magFilter: 'linear',
});

// Make a render pass descriptor for each MIP level, with the MIP as the drawing buffer
let renderPassDescriptors = seq(mips).map(i => ({
  colorAttachments: [makeColorAttachment(views[i], null, [0, 0, 0, 0], 'load')],
} as GPURenderPassDescriptor));

// Set the right color format for the color attachment(s)
let colorStates = [makeColorState(format)];

// Make a rendering pipeline for drawing a strip of triangles
let pipeline = makeRenderPipeline(device, vertex, fragment, colorStates, undefined, 1, {
  primitive: {
    topology: "triangle-strip",
  },
  vertex:   {buffers: mesh.attributes},
  fragment: {},
});

// Make a bind group for each MIP as the texture input
let bindGroups = seq(mips).map((mip: number) => makeTextureBinding(device, pipeline, sampler, views[mip]));

// Create a command encoder
let commandEncoder = device.createCommandEncoder();

// For loop - Mip levels
for (let i = 1; i < mips; ++i) {

  // Begin a new render pass
  let passEncoder = commandEncoder.beginRenderPass(renderPassDescriptors[i]);
  
  // Bind render pipeline
  passEncoder.setPipeline(pipeline);

  // Bind previous MIP level
  passEncoder.setBindGroup(0, bindGroups[i - 1]);

  // Bind geometry
  passEncoder.setVertexBuffer(0, vertexBuffer);

  // Actually draw 1 MIP level
  passEncoder.draw(mesh.count, 1, 0, 0);

  // Finish
  passEncoder.end();
}

// Send to GPU
device.queue.submit([commandEncoder.finish()]);

The most important thing to notice is that it has a for loop just like the CPU version, near the end. But before, during, and after, there is an enormous amount of set up required.

For people learning GPU programming, this by itself represents a challenge. There's not just jargon, but tons of different concepts (pipelines, buffers, textures, samplers, ...). All are required and must be hooked up correctly to do something that the GPU should treat as a walk in the park.

That's just the initial hurdle, and by far not the worst one.

Use.GPU Plot aka MathBox 3

The Big Lie

You see, no real application would want to have the code above. Because every time this code runs, it would do all the set-up entirely from scratch. If you actually want to do this practically, you would need to rewrite it to add lots of caching. The shader stays the same every time for example, so you want to create it once and then re-use it. The shader also uses relative coordinates 0...1, so you can use the same geometry even if the image is a different size.

Other parts are less obvious. For example, the render pipeline and all the associated colorState depend entirely on the color format: RGBA or BGRA. If you need to handle both, you would need to cache two versions of everything. Do you need to?

The data dependencies are quite subtle. Some parts depend only on the data type (i.e. format), while other parts depend on an actual data value (i.e. the contents of texture)... but usually both are aspects of one and the same object, so it's very difficult to effectively separate them. Some dependencies are transitive: we have to create an array of views to access the different sizes of the texture (image), but then several other things depend on views, such as the colorAttachments (inside pipeline) and the bindGroups.

There is one additional catch. Everything you do with the GPU happens via a device context. It's entirely possible for that context to be dropped by the browser/OS. In that case, it's your responsibility to start anew, recreating every single resource you used. This is btw the API design equivalent of a pure dick move. So whatever caching solution you come up with, it cannot be fire-and-forget: you need to invalidate and refresh too. And we all know how hard that is.

This is what all GPU rendering code is like. You don't spend most of your time doing the work, you spend most of your time orchestrating for the work to happen. What's amazing is that it means every GPU API guide is basically a big book of lies, because it glosses over these problems entirely. It's just assumed that you will intuit automatically how it should actually be used, even though it actually takes weeks, months, years of trying. You need to be intimately familiar with the whys in order to understand the how.

One can only conclude that the people making the APIs rarely, if ever, talk to the people using the APIs. Like backend and frontend web developers, the backend side seems blissfully unaware of just how hairy things get when you actually have to let people interact with your software instead of just other software. Instead, you get lots of esoteric features and flags that are never used except in the rarest of circumstances.

Few people in the scene really think any of this is a problem. This is just how it is. The art of creating a GPU renderer is to carefully and lovingly choose every aspect of your particular solution, so that you can come up with a workable answer to all of the above. What formats do you handle, and which do you not? Do all meshes have the same attributes or not? Do you try to shoehorn everything through one uber-pipeline/shader, or do you have many? If so, do you create them by hand, or do you use code generation to automate it? Also, where do you keep the caches? And who owns them?

It shouldn't be a surprise that the resulting solutions are highly bespoke. Each has its own opinionated design decisions and quirks. Adopting one means buying into all of its assumptions wholesale. You can only really swap out two renderers if they are designed to render exactly the same kind of thing. Even then, upgrading e.g. from Unreal Engine 4 to 5 is the kind of migration only a consultant can love.

This goes a very long way towards explaining the problem, but it doesn't actually explain the why.

Use.GPU has first class GPU picking support.

Memory vs Compute

There is a very different angle you can approach this from.

GPUs are, essentially, massively parallel pure function applicators. You would expect that functional programming would be a huge influence. Except it's the complete opposite: pretty much all the established practices derive from C/C++ land, where the men are men, state is mutable and the pointers are unsafe. To understand why, you need to face the thing that FP is usually pretty bad at: dealing with the performance implications of its supposedly beautiful abstractions.

Let's go back to the CPU model, where we had a function Image => Image. The FP way is to compose it, threading together a chain of Image → Image → .... → Image. This acts as a new function Image => Image. The surrounding code does not have to care, and can't even notice the difference. Yay FP.

But suppose you have a function that makes an image grayscale, and another function that increases the contrast. In that case, their composition Image => Image + Image => Image makes an extra intermediate image, not just the result, so it uses twice as much memory bandwidth. On a GPU, this is the main bottleneck, not computation. A fused function Image => Image that does both things at the same time is typically twice as efficient.

The usual way we make code composable is to split it up and make it pass bits of data around. As this is exactly what you're not supposed to do on a GPU, it's understandable that the entire field just feels like bizarro land.

It's also trickier in practice. A grayscale or contrast adjustment is a simple 1-to-1 mapping of input pixels to output pixels, so the more you fuse operations, the better. But the memory vs compute trade-off isn't always so obvious. A classic example is a 2D blur filter, which reads NxN input pixels for every output pixel. Here, instead of applying a single 2D blur, you should do a separate 1D Nx1 horizontal blur, save the result, and then do a 1D 1xN vertical blur. This uses less bandwidth in total.

But this has huge consequences. It means that if you wish to chain e.g. Grayscale → Blur → Contrast, then it should ideally be split right in the middle of the two blur passes:

Image → (Grayscale + Horizontal Blur) → Memory → (Vertical Blur + Contrast) → ...

In other words, you have to slice your code along invisible internal boundaries, not along obvious external ones. Plus, this will involve all the same bureaucratic descriptor nonsense you saw above. This means that a piece of code that normally would just call a function Image => Image may end up having to orchestrate several calls instead. It must allocate a place to store all the intermediate results, and must manually wire up the relevant save-to-storage and load-from-storage glue on both sides of every gap. Exactly like the big blob of code above.

When you let C-flavored programmers loose on these constraints, it shouldn't be a surprise that they end up building massively complex, fused machines. They only pass data around when they actually have to, in highly packed and compressed form. It also shouldn't be a surprise that few people beside the original developers really understand all the details of it, or how to best make use of it.

There was and is a massive incentive for all this too, in the form of AAA gaming. Gaming companies have competed fiercely under notoriously harsh working conditions, mostly over marginal improvements in rendering quality. The progress has been steady, creeping ever closer to photorealism, but it comes at the enormous human cost of having to maintain code that pretty much becomes unmaintainable by design as soon as it hits the real world.

This is an important realization that I had a long time ago. That's because composing Image => Image is basically how Winamp's AVS visualizer worked, which allowed for fully user-composed visuals. This was at a time when CPUs were highly compute-constrained. In those days, it made perfect sense to do it this way. But it was also clear to anyone who tried to port this model to GPU that it would be slow and inefficient there. Ever since then, I have been exploring how to do serious fused composition for GPU rendering, while retaining full end-user control over it.

Use.GPU Render-To-Texture, aka Milkdrop / AVS (except in Float16 Linear RGB)

Burrito-GPU

Functional programmers aren't dumb, so they have their own solutions for this. It's much easier to fuse things together when you don't try to do it midstream.

For example, monadic IO. In that case, you don't compose functions Image => Image. Rather, you compose a list of all the operations to apply to an image, without actually doing them yet. You just gather them all up, so you can come up with an efficient execution strategy for the whole thing at the end, in one place.

This principle can be applied to shaders, which are pure functions. You know that the composition of function A => B and B => C is of type A => C, which is all you need to know to allow for further composition: you don't need to actually compose them yet. You can also use functions as arguments to other shaders. Instead of a value T, you pass a function (...) => T, which a shader calls in a pre-determined place. The result is a tree of shader code, starting from some main(), which can be linked into a single program.

To enable this, I defined some custom @attributes in WGSL which my shader linker understands:

@optional @link fn getTexture(uv: vec2) -> vec4 { return vec4(1.0, 1.0, 1.0, 1.0); };

@export fn getTextureFragment(color: vec4, uv: vec2) -> vec4 {
  return color * getTexture(uv);
}

The function getTextureFragment will apply a texture to an existing color, using uv as the texture coordinates. The function getTexture is virtual: it can be linked to another function, which actually fetches the texture color. But the texture could be entirely procedural, and it's also entirely optional: by default it will return a constant white color, i.e. a no-op.

It's important here that the functions act as real closures rather than just strings, with the associated data included. The goal is to not just to compose the shader code, but to compose all the orchestration code too. When I bind an actual texture to getTexture, the code will contain a texture binding, like so:

@group(...) @binding(...) var mipTexture: texture_2d;
@group(...) @binding(...) var mipSampler: sampler;

fn getTexture(uv: vec2) -> vec4 {
  return textureSample(mipTexture, mipSampler, uv);
}

When I go to draw anything that contains this piece of shader code, the texture should travel along, so it can have its bindings auto-generated, along with any other bindings in the shader.

That way, when our blur filter from earlier is assigned an input, that just means linking it to a function getTexture. That input could be a simple image, or it could be another filter being fused with. Similarly, the output of the blur filter can be piped directly to the screen, or it could be passed on to be fused with other shader code.

What's really neat is that once you have something like this, you can start taking over some of the work the GPU driver itself is doing today. Drivers already massage your shaders, because much of what used to be fixed-function hardware is now implemented on general purpose GPU cores. If you keep doing it the old way, you remain dependent on whatever a GPU maker decides should be convenient. If you have a monad-ish shader pipeline instead, you can do this yourself. You can add support for a new packed data type by polyfilling in the appropriate encoder/decoder code yourself automatically.

This is basically the story of how web developers managed to force browsers to evolve, even though they were monolithic and highly resistant to change. So I think it's a very neat trick to deploy on GPU makers.

There is of course an elephant in this particular room. If you know GPUs, the implication here is that every call you make can have its own unique shader... and that these shaders can even change arbitrarily at run-time for the same object. Compiling and linking code is not exactly fast... so how can this be made performant?

There are a few ingredients necessary to make this work.

The easy one is, as much as possible, pre-parse your shaders. I use a webpack plug-in for this, so that I can include symbols directly from .wgsl in TypeScript:

import { getFaceVertex } from '@use-gpu/wgsl/instance/vertex/face.wgsl';

A less obvious one is that if you do shader composition using source code, it's actually far less work than trying to compose byte code, because it comes down to controlled string concatenation and replacement. If guided by a proper grammar and parse tree, this is entirely sound, but can be performed using a single linear scan through a highly condensed and flattened version of the syntax tree.

This also makes perfect sense to me: byte code is "back end", it's designed for optimal consumption by a run-time made by compiler engineers. Source code is "front end", it's designed to be produced and typed by humans, who argue over convenience and clarity first and foremost. It's no surprise which format is more bureaucratic and which allows for free-form composition.

The final trick I deployed is a system of structural hashing. As we saw before, sometimes code depends on a value, sometimes it only depends on a value's type. A structural hash is a hash that only considers the types, not the values. This means if you draw the same kind of object twice, but with different parameters, they will still have the same structural hash. So you know they can use the exact same shader and pipeline, just with different values bound to it.

In other words, structural hashing of shaders allows you to do automatically what most GPU programmers orchestrate entirely by hand, except it works for any combination of shaders produced at run-time.

The best part is that you don't need to produce the final shader in order to know its hash: you can hash along the way as you build the monadic data structure. Even before you actually start linking it, you can know if you already have the result. This also means you can gather all the produced shaders from a program by running it, and then bake them to a more optimized form for production. It's a shame WebGPU has no non-text option for loading shaders then...

Use the GPU

If you're still following along, there is really only one unanswered question: where do you cache?

Going back to our original big blob of code, we observed that each part had unique data and type dependencies, which were difficult to reason about. Given rare enough circumstances, pretty much all of them could change in unpredictable ways. Covering all bases seems both impractical and insurmountable.

It turns out this is 100% wrong. Covering all bases in every possible way is not only practical, it's eminently doable.

Consider some code that calls some kind of constructor:

let foo = makeFoo(bar);

If you set aside all concerns and simply wish for a caching pony, then likely it sounds something like this: "When this line of code runs, and bar has been used before, it should return the same foo as before."

The problem with this wish is that this line of code has zero context to make such a decision. For example, if you only remember the last bar, then simply calling makeFoo(bar1) makeFoo(bar2) will cause the cache to be trashed every time. You cannot simply pick an arbitrary N of values to keep: if you pick a large N, you hold on to lots of irrelevant data just in case, but if you pick a small N, your caches can become worse than useless.

In a traditional heap/stack based program, there simply isn't any obvious place to store such a cache, or to track how many pieces of code are using it. Values on the stack only exist as long as the function is running: as soon as it returns, the stack space is freed. Hence people come up with various ResourceManagers and HandlePools instead to track that data in.

The problem is really that you have no way of identifying or distinguishing one particular makeFoo call from another. The only thing that identifies it, is its place in the call stack. So really, what you are wishing for is a stack that isn't ephemeral but permanent. That if this line of code is run in the exact same run-time context as before, that it could somehow restore the previous state on the stack, and pick up where it left off. But this would also have to apply to the function that this line of code sits in, and the one above that, and so on.

Storing a copy of every single stack frame after a function is done seems like an insane, impractical idea, certainly for interactive programs, because the program can go on indefinitely. But there is in fact a way to make it work: you have to make sure your application has a completely finite execution trace. Even if it's interactive. That means you have to structure your application as a fully rewindable, one-way data flow. It's essentially an Immediate Mode UI, except with memoization everywhere, so it can selectively re-run only parts of itself to adapt to changes.

For this, I use two ingredients:
- React-like hooks, which gives you permanent stack frames with battle-hardened API and tooling
- a Map-Reduce system on top, which allows for data and control flow to be returned back to parents, after children are done

What hooks let you do is to turn constructors like makeFoo into:

let foo = useFoo(bar, [...dependencies]);

The use prefix signifies memoization in a permanent stack frame, and this is conditional on ...dependencies not changing (using pointer equality). So you explicitly declare the dependencies everywhere. This seems like it would be tedious, but I find actually helps you reason about your program. And given that you pretty much stop writing code that isn't a constructor, you actually have plenty of time for this.

The map-reduce system is a bit trickier to explain. One way to think of it is like an async/await:

async () => {
  // ...
  let foo = await fetch(...);
  // ...
}

Imagine for example if fetch() didn't just do an HTTP request, but actually subscribed and kept streaming in updated results. In that case, it would need to act like a promise that can resolve multiple times, without being re-fetched. The program would need to re-run the part after the await, without re-running the code before it.

Neither promises nor generators can do this, so I implement it similar to how promises were first implemented, with the equivalent of a .then(...):

() => {
   // ...
   return gather(..., (foo) => {
     //...
   });
}

When you isolate the second half inside a plain old function, the run-time can call it as much as it likes, with any prior state captured as part of the normal JS closure mechanism. Obviously it would be neater if there was syntactic sugar for this, but it most certainly isn't terrible. Here, gather functions like the resumable equivalent of a Promise.all.

What it means is that you can actually write GPU code like the API guides pretend you can: simply by creating all the necessary resources as you need them, top to bottom, with no explicit work to juggle the caches, other than listing dependencies. Instead of bulky OO classes wrapping every single noun and verb, you write plain old functions, which mainly construct things.

In JS there is the added benefit of having a garbage collector to do the destructing, but crucially, this is not a hard requirement. React-like hooks make it easy to wrap imperative, non-reactive code, while still guaranteeing clean up is always run correctly: you can pass along the code to destroy an object or handle in the same place you construct it.

It really works. It has made me over 10x more productive in doing anything GPU-related, and I've done this in C++ and Rust before. It makes me excited to go try some new wild vertex/fragment shader combo, instead of dreading all the tedium in setting it up and not missing a spot. What's more, all the extra performance hacks and optimizations that I would have to add by hand, it can auto-insert, without me ever thinking about it. WGSL doesn't support 8-bit storage buffers and only has 32-bit? Well, my version does. I can pass a Uint8Array as a vec and not think about it.

The big blob of code in this post is all real, with only some details omitted for pedagogical clarity. I wrote it the other day as a test: I wanted to see if writing vanilla WebGPU was maybe still worth it for this case, instead of leveraging the compositional abstractions that I built. The answer was a resounding no: right away I ran into the problem that I had no place to cache things, and the solution would be to come up with yet another ad-hoc variant of the exact same thing the run-time already does.

Once again, I reach the same conclusion: the secret to cache invalidation is no mystery. A cache is impossible to clear correctly when a cache does not track its dependencies. When it does, it becomes trivial. And the best place to cache small things is in a permanent stack frame, associated with a particular run-time call site. You can still have bigger, more application-wide caches layered around that... but the keys you use to access global caches should generally come from local ones, which know best.

All you have to do is completely change the way you think about your code, and then you can make all the pretty pictures you want. I know it sounds facetious but it's true, and the code works. Now it's just waiting for WebGPU to become accessible without developer flags.

Veterans of GPU programming will likely scoff at a single-threaded run-time in a dynamic language, which I can somewhat understand. My excuse is very straightforward: I'm not crazy enough to try and build this multi-threaded from day 1, in a static language where every single I has to be dotted, and every T has to be crossed. Given that the run-time behaves like an async incremental data flow, there are few shady shortcuts I can take anyway... but the ability to leverage the any type means I can yolo in the few places I really want to. A native version could probably improve on this, but whether you can shoehorn it into e.g. Rust's type and ownership system is another matter entirely. I leave that to other people who have the appetite for it.

The idea of a "bespoke shader for every draw call" also doesn't prevent you from aggregating them into batches. That's how Use.GPU's 2D layout system works: it takes all the emitted shapes, and groups them into unique layers, so that shapes with the same kind of properties (i.e. archetype) are all batched together into one big buffer... but only if the z-layering allows for it. Similar to the shader system itself, the UI system assumes every component could be a special snowflake, even if it usually isn't. The result is something that works like dear-imgui, without its obvious limitations, while still performing spectacularly frame-to-frame.

Use.GPU Layout - aka HTML/CSS

For an encore, it's not just a box model, but the box model, meaning it replicates a sizable subset of HTML/CSS with pixel-perfect precision and perfectly smooth scaling. It just has a far more sensible and memorable naming scheme, and it excludes a bunch of things nobody needs. Seeing as I have over 20 years of experience making web things, I dare say you can trust I have made some sensible decisions here. Certainly more sensible than W3C on a good day, amirite?

* * *

Use.GPU is not "finished" yet, because there are still a few more things I wish to make composable; this is why only the shader compiler is currently on NPM. However, given that Use.GPU is a fully "user space" framework, where all the "native" functionality sits on an equal level with custom code, this is a matter of degree. The "kernel" has been ready for half a year.

One such missing feature is derived render passes, which are needed to make order-independent transparency pleasant to use, or to enable deferred lighting. I have consistently waited to build abstractions until I have a solid set of use cases for it, and a clear idea of how to do it right. Not doing so is how we got into this mess into the first place: with ill-conceived extensions, which often needlessly complicate the base case, and which nobody has really verified if it's actually what devs need.

In this, I can throw shade at both GPU land and Web land. Certain Web APIs like WebAudio are laughably inadequate, never tested on anything more than toys, and seemingly developed without studying what existing precedents do. This is a pitfall I have hopefully avoided. I am well aware of how a typical 3D renderer is structured, and I am well read on the state of the art. I just think it's horribly inaccessible, needlessly obtuse, and in high need of reinventing.

Edit: There is now more documentation at usegpu.live.

The code is on Gitlab. If you want to play around with it, or just shoot holes in it, please, be my guest. It comes with a dozen or so demo examples. It also has a sweet, fully reactive inspector tool, shown in the video above at ~1:30, so you don't even need to dig into the code to watch it work.

There will of course be bugs, but at least they will be novel ones... and so far, a lot fewer than usual.

The Hiker's Dilemma

2022-03-02T00:00:00+01:00

How to take care of your tribe

The other day I read:

"If you're hiking and you stop to let other people catch up, don't start walking immediately when they arrive. Because that means you got a rest and they didn't. I think about this a lot."

I want to dissect this sentiment because I also think it says a whole lot, but probably not the way the poster meant it. It's a perfect example of something that seems to pass for empathetic wisdom, but actually holds very little true empathy: an understanding of people who actually think differently from each other.

Giuseppe Milo

Point of Interest

Let's start with the obvious: the implication is that anyone who doesn't follow this advice is some kind of asshole. That's why people so readily shared it: it signals concern for the less able. A "fast hiker" denies others reasonable rest, mainly for their own selfish satisfaction, like some kind of bully or slave driver. But this implication is based on a few hidden assumptions.

Most obviously, it frames the situation as one in which only the slow hikers' needs are important. They don't get to enjoy the hike, because they arrive exhausted and beat. Meanwhile those "selfish" fast hikers are fully rested, and even get to walk at pace that is leisurely for them, if they want. So any additional rest is a luxury they don't even need. Still, they refuse to grant it to others unless they are properly educated. How rude.

To me, it seems that neither fast nor slow is actually happy in this situation. The kind of person who is fit enough to hike quickly, and faster than the rest, is likely the kind of person who wants to "feel the burn" in their muscles, and enjoys being exhausted at the end of the day. Meanwhile the kind of person who walks slowly, and complains about not being able to keep up, simply doesn't see extreme exertion and pushing their limits as a net plus.

Indeed, it assumes that it's very important for the entire group to stick together. That it would be bad to split up, or for someone to be left walking alone behind the pack. And also, that simply by walking ahead of others, you are forcing people to keep up, by excluding them and making them look bad. This implies that the goal of the hike is mainly social and tribal, and not e.g. exercise, or exploration, or developing self-sufficiency. But unless you're hiking in dangerous wilderness, there is no hard reason to prefer larger numbers.

Experienced hikers know that trails are typically classified by steepness and challenge. Certain places are also fine some times of the year, but not in snow or rain. Sometimes it involves ropes and mudslides. The entire idea of one-size-fits-all hiking trails is simply unrealistic, because those are called garden paths, and they usually have wheelchair ramps.

You can't even say that "average" walkers in the middle of the pack are automatically setting out the reasonable compromise, simply because that's what the majority in the group is comfortable with. Because what's considered average depends entirely on who shows up, and where they want to go.

The original "lesson" is not actually about respecting people's needs, or about ensuring accessibility for all. It's mainly about disregarding some people's preferences entirely in favor of certain others, holding up some arbitrary level of preference and skill as the norm. What's too far ahead is considered unreasonable. But if you take the advice to its logical conclusion, it would mean that everyone has to perform at the lowest common level, even if someone obviously doesn't belong there, and would be happier elsewhere.

In a world where many consider direct criticism a taboo, this to me is a far more valuable lesson, even if it's a far less agreeable and comfortable interpretation. If it seems absolute, that's itself a mistake: life is not a singular hike, measured on a single yardstick. We lead in some areas, and straggle in others. If you find yourself constantly lagging behind, you should find a different hiking group, instead of demanding that everyone else slow down. If you are leading and getting bored, don't be afraid to scout ahead: you'll be happier too.

It's Physics, Jim

There's another lesson buried here, which is worth exploring: why is it that some hikers can effortlessly go up and down winding paths for hours, while others can barely manage to keep up? Simply chalking it up to physical strength or fitness is not enough.

Imagine you are asked to move a bunch of heavy items from one place to another. You are given a choice of either a crate or a small wagon, both exactly the same size. I doubt anyone would prefer the crate, because we all understand the physics involved on an intuitive level. When you pull a wagon, you only exert yourself when you're trying to move it; but in order to use the crate, you must first lift it up, and then keep it suspended in the air. Even if you don't move while holding the crate, your arms will get tired.

This means that the effort required to use the wagon depends mainly on the distance and mass you need to move. Whereas the effort required for the crate also involves the amount of time you are holding the crate up. If you move it more slowly, you spend more of your energy simply staying in place. In contrast, the faster you move it, the less energy it wastes, even if it momentarily takes more effort. The next time you carry some heavy groceries into the house, observe your own movements, particularly the last "nudge" to get them onto the kitchen counter, and you will realize you already knew this.

This too applies to the hiking scenario: if you're climbing a slope, then simply staying upright takes significant physical effort. If you can ascend faster, you actually waste less of your energy doing so. When descending, the same applies: the harder you push back against gravity, the more tired you will get. Becoming an experienced hiker means developing a natural sense of balance and motion that takes maximum advantage of this. While climbing, you will learn to quickly push through any difficult spots, spending more time with your feet on solid, level ground. While descending, you will let yourself fall from ledge to ledge. You learn to move more like a wagon, less like a crate. Obviously it also helps to have the right wheels, aka footwear.

This is really general life advice. If you spend your time stressed, dealing with chaotic communication and planning, suffering the fallout of past mistakes, yours or others', then you're constantly standing on uneasy ground, wasting your energy just staying in place. If you can instead recognize trouble ahead, and know where you're going to plant your feet, it can feel effortless.

People make the same argument about e.g. obesity or poverty, that it creates a vicious cycle of reinforcing conditions. But they often fail to make the distinction in the two different ways to address it, because their main concern is a non-descript offer of aid and concern. If someone is standing on a slope, you don't just offer them your hand and let them hold on indefinitely, wasting both people's energy, because you will soon both fall down. You should instead get them on solid ground instead, and get them to move better on their own. If someone wants sympathy and aid but rejects offers of working on a solution, that means they don't want to expend any effort in solving it themselves.

There is however a flipside here: offers of aid have to be genuine and clearly stated. If someone is struggling socially, telling them to "just be yourself" is obliviousness masquerading as advice. Telling someone to open up, when you don't actually want to hear their point of view, is purely for self-satisfaction.

Here too, criticism and empathy are typically perceived as being at odds: the person who criticises unfruitful ways of offering aid is dismissed as uncaring, even if they are reading the situation better than most. But if everyone suddenly turns back and runs down a slippery slope again without thinking, being the loud asshole who asks what the hell they are doing is actually the sane thing to do.

Evo Psych Too

I don't think it's a coincidence that this morality lesson comes in the form of a hiking story. In our modern world, hiking is mainly a leisure activity, undertaken exactly because it speaks to our distant past of small tribes roaming in dangerous wilds: watch out for bears and bandits, stay in contact with each other, always be prepared, and don't underestimate the elements.

Unlike the clean and artificial environment of a gym, nature offers us an unfiltered and barely controllable obstacle course, where some of our lesser used instincts can come back to the forefront. All it takes is one storm to turn a carefully manicured park into a new wild challenge, which is accessible to some and inhospitable to others.

This is a contrast which contemporary society is very uneasy to acknowledge. Under the guise of equality and tolerance, anything that threatens to separate the men from the boys, or the men from the women, is considered improper in the "right" circles. Yet evolutionary psychology is impossible to ignore on this point: tens of thousands of years of selective pressure have cleaved humanity down the middle, creating entirely different social expectations. The most important data point here comes from our notions of bravery and cowardice.

Bravery is a virtue both men and women can have: fearlessness in the face of danger. Yet cowardice is a vice reserved uniquely for men: women can indulge in it as much as they like, with no social repercussions.

Today, you can see this split clearly in the discussion around refugees. While women are said to "flee from danger", men are accused of "leaving people behind". The presence of children is usually said to make the difference, but the crucial point here is of course whom the children are assumed to be safer with. If you think that's the mother, then you are tacitly admitting that you believe she is more likely to—and perhaps more deserving of—receiving unconditional aid and shelter, even in a war zone.

Furthermore, the archetype of a mean girl is someone who uses an authority figure to do her dirty work. This ought to register as cowardly, but simply doesn't. Despite half a century of organized gender study, I know of no feminist who has seriously endeavored for this patriarchal social construct to be dismantled. Indeed, women's groups shaming men for cowardice in a time of war is a historical fact.

In hiking terms, it means that those who have learned to navigate dangerous terrain out of necessity are oddly assumed to be unreasonably privileged, while those who instinctively expect the presence of ropes and steps are said to be disadvantaged. This is entirely backwards to me, but it also seems obvious there is no convincing people otherwise. All you can do is realize that there are some who persistently demand you help them up, but who will never extend the same courtesy in return. They do so without ever feeling any shame about it, so you must draw your lessons accordingly.

* * *

This is a far less agreeable and happy-go-lucky interpretation of the hiker's dilemma, and one I doubt typical virtue peddlers will be comfortable with.

The original underlying sentiment was that social concerns and group norms always override meritocracy. That there is no reasonable view otherwise. But social issues are themselves difficult hurdles to navigate and path around, almost always subjective and based entirely on the framing. In doing so, proponents are merely striving for a meritocracy based on a different scoring system, one where they come out on top and ahead, far from risk and danger. It's cowardly not to admit it. And if it threatens to wash away all that was built, then it's imperative to oppose it.

(If you think this is a call for war, you are not paying attention.)

React - The Missing Parts

2022-02-05T00:00:00+01:00

Question the rules for fun and profit

One of the nice things about having your own lean copy of a popular library's patterns is that you can experiment with all sorts of changes.

In my case, I have a React-clone, Live, which includes all the familiar basics: props, state and hooks. The semantics are all the same. The premise is simple: after noticing that React is shaped like an incremental, resumable effect system, I wanted to see if I could use the same patterns for non-UI application code too.

Thus, my version leaves out the most essential part of React entirely: actually rendering to HTML. There is no React-DOM as such, and nothing external is produced by the run-time. Live Components mainly serve to expand into either other Live Components, or nothing. This might sound useless, but it turns out it's not.

I should emphasize though, I am not talking about the React useEffect hook. The Effect-analog in React are the Components themselves.

Along the way, I've come up with some bespoke additions and tweaks to the React API, with some new ergonomics. Together, these form a possible picture of React: The Missing Parts that is fun to talk about. It's also a trip to a parallel universe where the React team made different decisions, subject to far less legacy constraints.

On the menu:

No-hooks and Early Return
Component Morphing
useMemo vs useEffect + setState
and some wild Yeet-Reduce results

Break the Rules

One of the core features of contemporary React is that it has rules. Many are checked by linters and validated at run-time (in dev mode). You're not allowed to break them. Don't mutate that state. Don't skip this dependency.

Mainly they are there to protect developers from old bad habits: each rule represents an entire class of UI bugs. These are easy to create, difficult to debug and even harder to fix. Teaching new React devs to stick to them can be hard, as they don't yet realize all the edge cases users will expect to work. Like for example, that external changes should be visible immediately, just like local changes.

Other rules are inherent limitations in how the React run-time works, which simply cannot be avoided. But some are not.

At its core, React captures an essential insight about incremental, resumable code: that ordinary arrays don't fit into such a model at all. If you have an array of some objects [A, B, C, D], which changes to an array [B*, A, E, D*, C], then it takes a slow deep diff to figure out that 4 elements were moved around, 2 of which were changed, and only 1 was added. If each element has a unique key and is immutable however, it's pretty trivial and fast.

Hence, when working incrementally, you pretty much always want to work with key/value maps, or some equivalent, not plain arrays.

Once you understand this, you can also understand why React hooks work the way they do. Hooks are simple, concise function calls that do one thing. They are local to an individual component, which acts as their scope.

Hooks can have a state, which is associated anonymously with each hook. When each hook is first called, its initial state is added to a list: [A, B, C, ...]. Later, when the UI needs to re-render, the previous state is retrieved in the same order. So you need to make the exact same calls each time, otherwise they would get the wrong state. This is why you can't call hooks from within if or for. You also can't decide to return early in the middle of a bunch of hook calls. Hooks must be called unconditionally.

If you do need to call a hook conditionally, or a variable number of times, you need to wrap it in a sub-component. Each such component instance is then assigned a key and mounted separately. This allows the state to be matched up, as separate nodes in the UI tree. The downside is that now it's a lot harder to pass data back up to the original parent scope. This is all React 101.

if (foo) {
  const value = useMemo(..);
  // ...
}
else {
  useNoMemo(..);
}

But there's an alternative. What if, in addition to hooks like useContext and useMemo, you had a useNoContext and useNoMemo?

When you call useNoMemo, the run-time can simply skip ahead by 1 hook. Graphics programmers will recognize this as shader-like control flow, where inactive branches are explicitly kept idle. While somewhat cumbersome to write, this does give you the affordance to turn hooks on or off with if statements.

However, a useNo... hook is not actually a no-op in all cases. It will have to run clean-up for the previous not-no-hook, and throw away its previous state. This is necessary to dispose of associated resources and event handlers. So you're effectively unmounting that hook.

This means this pattern can also enable early return: this should automatically run a no-hook for any hook that wasn't called this time. This just requires keeping track of the hook type as part of the state array.

Is this actually useful in practice? Well, early return and useNoMemo definitely is. It can mean you don't have to deal with null and if in awkward places, or split things out into subcomponents. On the other hand, I still haven't found a direct use for useNoState.

useNoContext is useful for the case where you wish to conditionally not depend on a context even if it has been provided upstream. This can selectively avoid an unnecessary dependency on a rapidly changing context.

The no-hook pattern can also apply to custom hooks: you can write a useNoFoo for a useFoo you made, which calls the built-in no-hooks. This is actually where my main interest lies: putting an if around one useState seems like an anti-pattern, but making entire custom hooks optional seems potentially useful. As an example, consider that Apollo's query and subscription hooks come with a dedicated skip option, which does the same thing. Early return is a bad idea for custom hooks however, because you could only use such a hook once per component, as the last call.

You can however imagine a work-around. If the run-time had a way to push and pop a new state array in place, starting from 0 anew, then you could safely run a custom hook with early return. Let's imagine such a useYolo:

// A hook
const useEarlyReturnHook = (...) => {
  useMemo(...);
  if (condition) return false;
  useMemo(...);
  return true;
}

{
  // Inside a component
  const value1 = useYolo(() => useEarlyReturnHook(...));
  const value2 = useYolo(() => useEarlyReturnHook(...));
}

But that's not all. If you call our hotline now, you also get hooks in for-loops for free. Because a for-loop is like a repeating function with a conditional early return. So just wrap the entire for loop in useYolo, right?

Except, this is a really bad idea in most cases. If it's looping over data, it will implicitly have the same [A, B, C, D] to [B*, A, E, D*, C] matching problem: every hook will have to refresh its state and throw away caches, because all the input data has seemingly changed completely, when viewed one element at a time.

So while I did actually make a working useYolo, I ended up removing it again, because it was more footgun than feature. Instead, I tried a few other things.

Morph

One iron rule in React is this: if you render one type of component in place of another one, then the existing component will be unmounted and thrown away entirely. This is required because each component could do entirely different things.

 renders:

 renders:

Logically this also includes any rendered children. If and both render a , and you swap out an with a at run-time, then that will not be re-used. All associated state will be thrown away, and any children too. If component has no state at all, and the same props as before, this is 100% redundant. This applies to all flavors of "styled components" for example, which are just passive, decorated HTML elements.

One case where this is important is in page routing for apps. In this case, you have a , a , and so on, which likely look very similar. They both contain some kind of and they likely share most of their navigation and sidebars. But because and are different components, the will not be reused. When you change pages, everything inside will be rebuilt, which is pretty inefficient. The solution is to lift the out somehow, which tends to make your route definitions very ugly, because you have to inline everything.

It's enough of a problem that React Router has redesigned its API for the 6th time, with an explicit solution. Now a can contain an , which is an explicit slot to be filled with dynamic page contents. You can also nest layouts and route definitions more easily, letting the Router do the work of wrapping.

It's useful, but to me, this feels kinda backwards. An serves the same purpose as an ordinary React children prop. This pattern is reinventing something that already exists, just to enable different semantics. And there is only one outlet per route. There is a simpler alternative: what if React could just keep all the children when it remounts a parent?

In Live, this is available on an opt-in basis, via a built-in wrapper. Any component directly inside will morph in-place when its type changes. This means its children can also be updated in place, as long as their type hasn't changed in turn. Or unless they are also wrapped in .

So from the point of view of the component being morphed, it's a full unmount/remount... but from the point of view of the matching children, nothing is changing at all.

Implementing this was relatively easy, again a benefit of no-hooks and built-in early return which makes it easy to reset state. Dealing with contexts was also easy, because they only change at context providers. So it's always safe to copy context between two ordinary sibling nodes.

You could wonder if it makes sense for morphing to be the default behavior in React, instead of the current strict remount. After all, it shouldn't ever break anything, if all the components are written "properly" in a pure and functional way. But the same goes for memo(…)... and that one is still opt-in?

Making opt-in also makes a lot of sense. It means the default is to err on the side of clean-slate reproducibility over performance, unless there is a reason for it. Otherwise, all child components would retain all their state by default (if compatible), which you definitely don't want in all cases.

For a , I do think it should automatically morph each routed page instead of remounting. That's the entire point of it: to take a family of very similar page components, and merge them into a single cohesive experience. With this one minor addition to the run-time, large parts of the app tree can avoid re-rendering, when they used to before.

You could however argue the API for this should not be a run-time , but rather a static morph(…) which wraps a Component, similar to memo(…). This would mean that is up to each Component to decide whether it is morphable, as opposed to the parent that renders it. But the result of a static morph(…) would just be to always render a run-time with the original component inside, so I don't think it matters that much. You can make a static morph(…) yourself in user-land.

Stateless

One thing React is pretty insistent about is that rendering should be a pure function. State should not be mutated during a render. The only exception is the initial render, where e.g. useState accepts a value to be set immediately:

const initialState = props.value.toString(); const [state, setState] = useState(initialState);

Once mounted, any argument to useState is always ignored. If a component wishes to mutate this state later, e.g. because props.value has changed, it must schedule a useEffect or useLayoutEffect afterwards:

useEffect(() => { if (...) setState(props.value.toString()); }, [props.value])

This seems simple enough, and stateless rendering can offer a few benefits, like the ability to defer effects, to render components concurrently, or to abort a render in case promises have not resolved yet.

In practice it's not quite so rosy. For one thing, this is also where widgets have to deal with parsing/formatting, validation, reconciling original and edited data, and so on. It's so much less obvious than the initialState pattern, that it's a typical novice mistake with React to not cover this case at all. Devs will build components that can only be changed from the inside, not the outside, and this causes various bugs later. You will be forced to use key as a workaround, to do the opposite of a : to remount a component even if its type hasn't changed.

With the introduction of the hooks API, React dropped any official notion of "update state for new props", as if the concept was not pure and React-y enough. You have to roll your own. But the consequence is that many people write components that don't behave like "proper" React components at all.

If the state is always a pure function of a prop value, you're supposed to use a useMemo instead. This will always run immediately during each render, unlike useEffect. But a useMemo can't depend on its own previous output, and it can't change other state (officially), so it requires a very different way of thinking about derived logic.

From experience, I know this is one of the hardest things to teach. Junior React devs reach for useEffect + setState constantly, as if those are the only hooks in existence. Then they often complain that it's just a more awkward way to make method calls. Their mental model of their app is still a series of unique state transitions, not declarative state values: "if action A then trigger change B" instead of "if state A then result B".

Still, sometimes useMemo just doesn't cut it, and you do need useEffect + setState. Then, if a bunch of nested components each do this, this creates a new problem. Consider this artificial component:

const Test = ({value = 0, depth = 5}) => { const [state, setState] = useState(value); useEffect(() => { setState(value); }, [value]) if (depth > 1) return ; return null; }

expands into:

Each will copy the value it's given into its state, and then pass it on. Let's pretend this is a real use case where state is actually meaningfully different. If a value prop changes, then the useEffect will change the state to match.

The problem is, if you change the value at the top, then it will not re-render 5 instances of Test, but 20 = 5 + 5 + 4 + 3 + 2 + 1.

Not only is this an N² progression in terms of tree depth, but there is an entire redundant re-render right at the start, whose only purpose is to schedule one effect at the very top.

That's because each useEffect only triggers after all rendering has completed. So each copy of Test has to wait for the previous one's effect to be scheduled and run before it can notice any change of its own. In the mean time it continues to re-render itself with the old state. Switching to the short-circuited useLayoutEffect doesn't change this.

In React, one way to avoid this is to wrap the entire component in memo(…). Even then that will still cause 10 = 5x2 re-renders, not 5: one to schedule the effect or update, and another to render its result.

Worse is, if Test passes on a mix of props and state to children, that means props will be updated immediately, but state won't. After each useEffect, there will be a different mix of new and old values being passed down. Components might act weird, and memo() will fail to cache until the tree has fully converged. Any hooks downstream that depend on such a mix will also re-roll their state multiple times.

This isn't just a rare edge case, it can happen even if you have only one layer of useEffect + setState. It will render things nobody asked for. It forces you to make your components fully robust against any possible intermediate state, which is a non-trivial ask.

To me this is an argument that useEffect + setState is a poor solution for having state change in response to props. It looks deceptively simple, but it has poor ergonomics and can cause minor re-rendering catastrophes. Even if you can't visually see it, it can still cause knock-on effects and slowdown. Lifting state up and making components fully controlled can address this in some cases, but this isn't a panacea.

Unintuitively, and buried in the docs, you can call a component's own setState(...) during its own render—but only if it's wrapped in an if to avoid an infinite loop. You also have to manually track the previous value in another useState and forego the convenient ergonomics of [...dependencies]. This will discard the returned result and immediately re-render the same component, without rendering children or updating the DOM. But there is still a double render for each affected component.

The entire point of something like React is to batch updates across the tree into a single, cohesive top-down data flow, with no redundant re-rendering cycles. Data ought to be calculated at the right place and the right time during a render, emulating the feeling of immediate mode UI.

Possibly a built-in useStateEffect hook could address this, but it requires that all such state is 100% immutable.

People already pass mutable objects down, via refs or just plain props, so I don't think "concurrent React" is as great an idea in practice as it sounds. There is a lot to be said for a reliable, single pass, top-down sync re-render. It doesn't need to be async and time-sliced if it's actually fast enough and memoized properly. If you want concurrency, manual fences will be required in practice. Pretending otherwise is naive.

My home-grown solution to this issue is a synchronous useResource instead, which is a useMemo with a useEffect-like auto-disposal ability. It runs immediately like useMemo, but can run a previous disposal function just-in-time:

const thing = useResource((dispose) => { const thing = makeThing(...); dispose(() => disposeThing(thing)); return thing; }, [...dependencies]);

This is particularly great when you need to set up a chain of resources that all have to have disposal afterwards. It's ideal for dealing with fussy derived objects during a render. Doing this with useEffect would create a forest of nullables, and introduce re-render lag.

Unlike all the previous ideas, you can replicate this just fine in vanilla React, as a perfect example of "cheating with refs":

const useResource = (callback, dependencies) => { // Ref holds the pending disposal function const disposeRef = useRef(null); const value = useMemo(() => { // Clean up prior resource if (disposeRef.current) disposeRef.current(); disposeRef.current = null; // Provide a callback to capture a new disposal function const dispose = (f) => disposeRef.current = f; // Invoke original callback return callback(dispose); }, dependencies); // Dispose on unmount // Note the double =>, this is for disposal only. useEffect(() => () => { if (disposeRef.current) disposeRef.current(); }, []); return value; }

It's worth mentioning that useResource is so handy that Live still has no useEffect at all. I haven't needed it yet, and I continue to not need it. With some minor caveats and asterisks, useEffect is just useResource + setTimeout. It's a good reminder that useEffect exists because of having to wait for DOM changes. Without a DOM, there's no reason to wait.

That said, the notion of waiting until things have finished rendering is still eminently useful. For that, I have something else.

Yeet

Consider the following UI requirement: you want an expandable tree view, where you can also drag-and-drop items between any two levels.

At first this seems like a textbook use case for React, with its tree-shaped rendering. Only if you try to build it, you discover it isn't. This is somewhat embarrassing for React aficionados, because as the dated screenshot hints, it's not like this is a particularly novel concept.

In order to render the tree, you have to enumerate each folder recursively. Ideally you do this in a pure and stateless way, i.e. via simple rendering of child components.

Each component only needs to know how to render its immediate children. This allows us to e.g. only iterate over the Folders that are actually open. You can also lazy load the contents, if the whole tree is huge.

But in order to do the drag-and-drop, you need to completely flatten what's actually visible. You need to know the position of every item in this list, counting down from the tree root. Each depends on the contents of all the previous items, including whether their state is open or closed. This can only be determined after all the visible child elements have recursively been loaded and rendered, which happens long after is done.

This is a scenario where the neat one-way data flow of React falls apart. React only allows for data to flow from parent to child during a render, not in the other direction.

If you wish to have respond when a or renders or changes, has to set up a callback so that it can re-render itself from the top down. You can set it up so it receives data gathered during the previous render from individual Items:

<––––. | ––˙

But, if you do this all "correctly", this will also re-render the originating . This will loop infinitely unless you ensure it converges to an inert new state.

If you think about it, it doesn't make much sense to re-run all of from scratch just to respond to a child it produced. The more appropriate place to do so would be at :

–––. | | | | | | <–––––˙

If Tree had not just a head, but also a tail, then that would be where it would resume. It would avoid the infinite loop by definition, and keep the data flow one-way.

If you squint and pretend this is a stack trace, then this is just a long-range yield statement... or a throw statement for non-exceptions... aka a yeet. Given that every can yeet independently, you would then gather all these values together, e.g. using a map-reduce. This produces a single set of values at , which can work on the lot. This set can be maintained incrementally as well, by holding on to intermediate reductions. This is yeet-reduce.

Also, there is no reason why can't render new components of its own, which are then reduced again, and so on, something like:

| ˙–>

If you put on your async/await goggles, then looks a lot like a rewindable/resumable await Promise.all, given that the data sources can re-render independently. Yeet-reduce allows you to reverse one-way data flow in local parts of your tree, flipping it from child to parent, without creating weird loops or conflicts. This while remaining fully incremental and react-y.

This may seem like an edge case if you think in terms of literal UI widgets. But it's a general pattern for using a reduction over one resumable tree to produce a new resumable tree, each having keys, and each being able to be mapped onto the next. Obviously it would be even better async and multi-threaded, but even single threaded it works like a treat.

Having it built into the run-time is a huge plus, which allows all the reductions to happen invisibly in the background, clearing out caches just-in-time. But today, you can emulate this in React with a structure like this:

... ...

Here, the YeetContext is assumed to provide some callback which is used to pass back values up the tree. This causes to re-render. It will then pass the collected values to Resume. Meanwhile Memo(Initial) remains inert, because it's memoized and its props don't change, avoiding an infinite re-rendering cycle.

This is mostly the same as what Live does, except that in Live the Memo is unnecessary: the run-time has a native concept of (a fiber continuation) and tracks its dependencies independently in the upwards direction as values are yeeted.

Such a YeetContext.Provider is really the opposite, a YeetContext.Consumer. This is a concept that also exists natively in Live: it's a built-in component that can gather values from anywhere downstream in the tree, exactly like a Context in reverse. The associated useConsumer hook consumes a value instead of providing it.

The only difference between Yeet-Reduce and a Consumer data flow, is that a Consumer explicitly skips over all the nodes in between: it doesn't map-reduce upwards along the tree, it just stuffs collected values directly into one flat set. So if the reverse of a Consumer is a Context, then the reverse of Yeet-Reduce is Prop Drilling. Although unlike Prop Drilling, Yeet-Reduce requires no boilerplate, it just happens automatically, by rendering a built-in inside a (array), (struct-of-arrays) or (any).

As an example of such a chain of expand-reduce-continuations, I built a basic HTML-like layout system, with a typical Absolute, Stack (Block) and Flex position model:

As I hover over components, the blue highlight shows which components were rendered by whom, while purple shows indirect, long-range data dependencies. In this video I'm not triggering any Live changes or re-renders. The inspector is external and implemented using vanilla React.

These particular layout components don't render anything themselves, rather, they yield lambdas that can perform layout. Once laid out, the result is applied to styled shapes. These styled shapes are themselves then aggregated together, so they can be drawn using a single draw call.

As I've demonstrated before, when you map-reduce lambdas, what you're really assembling incrementally is chunks of executable program code, which you can evaluate in an appropriate tree order to do all sorts of useful things. This includes generating GPU shaders on the fly: the bookkeeping needed to do so is mostly automatic, accomplished with hook dependencies, and by map-reducing the result over incremental sub-trees or sub-maps.

The actual shader linker itself is still plain old vanilla code: it isn't worth the overhead to try and apply these patterns at such a granular level. But for interactive code, which needs to respond to highly specific changes, it seems like a very good idea.

* * *

Most of all, I'm just having a lot of fun with this architecture. You may need a few years of labor in the front-end mines before you truly grok what the benefit is of structuring an entire application this way. It's a pretty simple value proposition tho: what if imgui, but limited to neither im nor gui?

The other day I was playing around with winit and wgpu in Rust, and I was struck by how weird it seemed that the code for setting up the window and device was entirely different whether I was initializing it, or responding to a resize. In my use.gpu prototype, the second type of code simply does not exist, except in the one place where it has to interface with a traditional .

That is to say, I hope it's not just the React team that is taking notes, but the Unreal and Unity teams too: this post isn't really about JavaScript or TypeScript… it's about how you can make the CPU side run and execute similar to the GPU side, while retaining as much of the data as possible every time.

The CPU/GPU gap is just a client/server gap of a different nature. On the web, we learned years ago that having both sides be isomorphic can bring entirely unexpected benefits, which nevertheless seem absurdly obvious in hindsight.

On Progress

2022-01-19T00:00:00+01:00

The known unknown knowns we lost

When people think of George Orwell's 1984, what usually comes to mind is the orwellianism: a society in the grip of a dictatorial, oppressive regime which rewrote history daily as if it was a casual matter.

Not me though. For whatever reason, since reading it as a teenager, what has stuck was something different and more specific. Namely that as time went on, the quality of all goods, services and tools that people relied on got unquestionably worse. In the story, this happened slowly enough that many people didn't notice. Even if they did, there was little they could do about it, because this degradation happened across the board, and the population had no choice but to settle for the only available options.

I think about this a lot, because these days, I see it everywhere around me. What's more, if you talk and listen to seniors, you will realize they see even more of it, and it's not just nostalgia. Do you know what you don't know?

Chickens roost and sleep in trees

A Chicken in Every Pot

From before I was born, my parents have grown their own vegetables. We also had chickens to provide us with more eggs than we usually knew what to do with. The first dish I ever cooked was an omelette, and in our family, Friday was Egg Day, where everyone would fry their own, any way they liked.

As a result, I remain very picky about the eggs I buy. A fresh egg from a truly free range chicken has an unmistakeable quality: the yolk is rich and deep orange. Nothing like factory-farmed cage eggs, whose yolks are bright yellow, flavorless and quite frankly, unappetizing. Another thing that stands out is how long our eggs would keep in the fridge. Aside from the freshness, this is because an egg naturally has a coating to protect it, when it comes out of the chicken. By washing them aggressively, you destroy this coating, increasing spoilage.

The same goes for the chickens themselves. I learned at an early age what it looks like to chop a chicken's head off with a machete. I also learned that chicken is supposed to be a flavorful meat with a distinct taste. The idea that other things would "taste like chicken" seems preposterous from this point of view. Rather, it's that most of the chicken we eat simply does not taste like chicken anymore. Industrial chickens are raised in entirely artificial circumstances, unhealthy and constrained, and this has a noticeable effect on the development and taste of the animal.

Here's another thing. These days when I fry a piece of store-bought meat, even when it's not frozen, the pan usually fills up with a layer of water after a minute. I have to pour it out, so I can properly brown it at high temperature and avoid steaming it. That's because a lot of meat is now bulked up with water, so it weighs more at the point of sale. This is not normal. If the only exposure you have to meat is the kind that comes in a styrofoam tray wrapped in plastic, you are missing out, and not even realizing it.

For vegetables and fruit, there is a similar degradation. Take tomatoes, which naturally bruise easily. In order to make them more suitable for transport, industrial tomatoes have mainly been selected for toughness. This again correlates to more water content. But as a side effect, most tomatoes simply don't taste like proper tomatoes anymore. The flavor that most people now associate with e.g. sun-dried, heirloom tomatoes, is simply what tomatoes used to taste like. Rather than buying them fresh, you are often better off buying canned Italian Roma tomatoes, which didn't suffer quite the same fate. Italians know their tomatoes, even if they are non-native to the country and continent.

For berries, it's the same story. Our yard had several bushes, with blueberries and red berries, and my mom would make jam out of them every year. But on a good day we would just eat them straight from the bush. I can tell you, the ones I buy in the store simply don't taste as good.

There is another angle to this too: preparation. Driven by the desire to serve more customers more quickly, industrial cooks prefer dishes that are easy to assemble and quick to make. But many traditional dishes involve letting stews and sauces simmer for hours at a time in a single pot, developing deep flavors over time. This is simply not compatible with rapid, mass production. It implies that you need to prepare it all ahead of time, in sufficient quantities. When was the last time you ordered something at a chain, and were told they had run out for the day?

Hence these days, growing your own food, raising your own animals, and cooking your own meals is not just a choice about self-sufficiency. It's a choice to favor artisanal methods over mass-scale production, which strongly affects the result. It's a choice to favor varieties for taste rather than what packages, transports and sells easily. To favor methods that are more labor intensive, but which build upon decades, even centuries of experience.

It also echoes a time when the availability of particular foods was incredibly seasonal, and building up preserves for winter was a necessity. People often had to learn to make do with basic, unglamorous ingredients, and they succeeded anyway. Add to this the fact that many countries suffered severe shortages during World War II, which is traceable in the local cuisine, and you end up with a huge amount of accumulated knowledge about food that we're slowly but surely losing.

Life in Plastic

It's difficult now to imagine a world without plastic. The first true plastic, bakelite, was developed in 1907. Since then, chemistry has delivered countless synthetic materials. But it would take over half a century for plastic to become truly common-place. With our oceans now full of floating micro-plastics, affecting the food chain, this seems to have been a dubious choice.

When I look at pictures of households from the 1950s, one thing that stands out to me is the materials. There is far more wood, metal, glass and fabric than there is plastic. These are all heavier materials, but also, tougher. When they did use plastic, the designs often look far bulkier than a modern equivalent. What's also absent is faux-materials: there's no plastic that's been painted glossy to look like metal, or particle board made to look like real wood, or acrylic substituting for real glass.

The problem is simple: when exposed to the UV rays in sunlight, plastic will degrade and discolor. When exposed to strain and tension, tough plastic will crack instead of flex. Hence, when you replace a metal or wooden frame with a plastic one, a product's lifespan will suffer. When it breaks, you can't simply manufacture a replacement using an ordinary tool shop either. Without a 3D printer and highly detailed measurements, you're usually out of luck, because you need one highly specific, molded part, which is typically attached not via explicit screws, but simply held in place via glue or tension. This tension will guarantee that such a part will fail sooner than later.

In fact, I have this exact problem with my freezer. The outside of the door is hooked up to the inside with 4 plastic brackets, each covering a metal piece. The metal is fine. But one plastic piece has already cracked from repeated opening, and probably the temperature shifts haven't helped either. The best thing I could do is glue it back on, because it's practically impossible to obtain the exact replacement I need. Whoever designed this, they did not plan for it to be used more than a few years. For an essential household appliance, this is shameful. And yet it is normal.

Products simply used to have a much longer lifespan. They were built to last and were expected to last. When you bought an appliance, even a small one, it was an investment. Whatever gains were made by producing something that is lighter and easier to transport were undone by the fact that you will now be transporting and disposing of 2 or 3 of them in the same time you used to only need just one.

This is also a difference that you can only notice in the long term. In the short term, people will prefer the cheaper product, even if it's more expensive eventually. Hence, the long-lasting products are pushed out of the market, replaced with imitations that seem more modern and less resource intensive, but which are in fact the exact opposite.

The only way to counter this is if there are sufficient craftsmen and experts around who provide sufficient demand for the "real" thing. If those craftsmen retire without passing on their knowledge, the degradation sets in. Even if the knowledge is passed on, it's worthless if the tools and parts those craftsmen depend on disappear or lose their luster.

This isn't limited to plastic either. Even parts that are made out of metal can be produced in good or bad ways. When cheap alloys replace expensive ones, when tolerances are slowly eroded away down to zero, the result is undeniably inferior. Yet it's difficult to tell without a detailed breakdown of the manufacturing process.

A striking example comes in the form of the Dubai Lamp. These are LED lamps, made specifically for the Dubai market, through an exclusive deal. They're identical in design to the normal ones, except the Dubai Lamp has far more LED filaments: it's designed to be underpowered instead of running close to tolerance. As a result, these lamps last much longer instead of burning out quickly.

Invisible Software

Luckily, the real world still provides plenty of sanity checks. The above is relatively easy to explain, because it can be stated in terms of our primary senses. If food tastes different, if a product feels shoddy and breaks more quickly, it's easy to notice, if you know what to look for.

But one domain where this does not apply at all is software. The reason is simple: software operates so quickly, it's beyond our normal ability to fathom. The primary goal of interactive software is to provide seamless experiences that deliberately hide many layers of complexity. As long as it feels fast enough, it is fast enough, even if it's actually enormously wasteful.

What's more, there's a perverse incentive for software developers here. At a glance, software developers are the most productive when they use the fastest computers: they spend the least amount of time waiting for code to be validated and compiled. In fact, when Apple released the new M1, which was at least 50% faster than the previous generation—sometimes far more—many companies rushed out and bought new laptops for their entire staff, as if it was a no-brainer.

However this has a terrible knock-on effect. If a developer has a machine that's faster than the vast majority of their users, then they will be completely misinformed what the typical experience actually is. They may not even notice performance problems, because a delay is small enough on their machine so as to be unobtrusive. This is made worse by the fact that most developers work in artificial environments, on reduced data sets. They will rarely reach the full complexity of a real world workload, unless they specifically set up tests for that purpose, informed by a detailed understanding of their users' needs.

On a slower machine, in a more complicated scenario, performance will inevitably suffer. For this reason, I make it a point to do all my development on a machine that is several years out of date. It guarantees that if it's fast enough for me, it will be fast enough for everyone. It means I can usually spot problems with my own eyes, instead of needing detailed profiling and analysis to even realize.

This is obvious, yet very few people in our industry do so. They instead prefer to have the latest shiny toys, even if it only provides a temporary illusion of being faster.

Apple Powerbook G4 Titanium (2001)

Dysfunctional Cloud

Where this problem really gets bad is with cloud-based services. The experience you get depends on the speed of your internet connection. Most developers will do their work entirely on their own machine, in a zero-latency environment, which no actual end-user can experience. The way the software is developed prevents everyday problems from being noticed until it's too late, by design.

Only in a highly connected urban environment, with fiber-to-the-door, and very little latency to the data center, will a user experience anything remotely closely to that. In that case, cloud-based software can provide an extremely quick and snappy experience that rivals local software. If not, it's completely different.

There is another huge catch. Implicit in the notion of cloud-based software is that most of the processing happens on the server. This means that if you wish to support twice as many users, you need twice as much infrastructure, to handle twice as many requests. For traditional off-line software, this simply does not apply: every user brings their own computer to the table, and provides their own CPU, memory and storage capacity for what they need. No matter how you structure it, software that can work off-line will always be cheaper to scale to a large user base in the long run.

From this point of view, cloud-based software is a trap in design space. It looks attractive at the start, and it makes it easy to on-board users seamlessly. It also provides ample control to the creator, which can be turned into artificial scarcity, and be monetized. But once it takes off, you are committed to never-ending investments, which grow linearly with the size of your user-base.

This means a cloud-based developer will have a very strong incentive to minimize the amount of resources any individual user can consume, limiting what they can do.

An obvious example is when you compare the experience of online e-mail vs offline e-mail. When using an online email client, you are typically limited to viewing one page of your inbox at a time, showing maybe 50 emails. If you need to find older messages, the primary way of doing so is via search; this search functionality has to be implemented on the server, indexed ahead of time, with little to no customization. There is also a functionality gap between the email itself and the attachments: the latter have to be downloaded and accessed separately.

In an offline email client, you simply have an endless inbox, which you can scroll through at will. You can search it whenever you want, even when not connected. And all the attachments are already there, and can be indexed by the OS' search mechanism. Even a cheap computer these days has ample resources to store and index decades worth of email and files.

Mozilla Thunderbird with integrated RSS

The New News

To illustrate the problems with monetization, you need only look at the average news site. To provide a source of income, they harvest data from their visitors, posting clickbait to attract them. But driven by GDPR and similar privacy laws, they now all have cookie dialogs, which make visiting such a site a miserable experience. As long as you keep rejecting cookies, you will keep having to reject cookies. Once you agree, you can no longer revoke consent. The geniuses who drafted such laws did not anticipate the obvious exception of letting sites set a single, non-identifiable "no" cookie, which would apply in perpetuity. Or likely they did, but it was lobbied out of consideration.

That's not all. In the early days of GDPR, these dialogs used to provide you with an actual choice, even if they did so reluctantly. But nowadays, even that has gone out of the window. Through the ridiculous concept of "legitimate interest", many now require you to explicitly object to fingerprinting and tracking, on a second panel which is buried. Simply clicking "Disagree" is not sufficient, because that button still means you agree to being "legitimately" tracked, for all the same purposes they used to need cookies for, including ad personalization. Fully objecting means manually unselecting half a dozen options with every visit, sometimes more.

The worst part is the excuse used to justify this: that newspapers have to make their money somehow. Yet this is a sham, because to my knowledge, no news site out there turns off the tracking for paying subscribers. You can pay to remove ads, but you can't pay to remove tracking. Why would they, when it's leaving money on the table, and fully legal? The resulting data sets are simply more valuable the more comprehensive they are.

In a different world, most people would do most of their reading via a subscription mechanism such as RSS. A social media client would be an aggregator that builds a feed from a variety of sources. Tracking users' interests would be difficult, because the act of reading is handled by local software.

Of course we can expect that in such a world, news sites would still try to use tracking pixels and other dubious tricks, but, as we have seen with email, remote images can be blocked, and it would at least give users a fighting chance to keep some of their privacy.

* * *

The conclusion seems obvious to me: the same kind of incentives that made industrial food what it is, and industrial manufacturing what it is, have made industrial software worse for everyone. And whereas web browsing used to be exactly that, browsing, it now means an active process where you are being tagged and tracked by software that spans a large chunk of the web, which makes the entire experience unquestionably worse.

The analogy is even stronger, because the news now seems equally bland and tasteless as the tomatoes most of us buy. The lore of RSS and distributed protocols has mostly been lost, and many software developers do not have the skills necessary to make off-line software a success in a connected world. Indeed, very few even bother to try.

It has all happened gradually, just like in 1984, and each individual has little power to stop it, except through their own choices.

Under the guise of progress, we tend to assume that changes are for the better, that the economy drives processes towards greater efficiency and prosperity. Unfortunately it's a fairy tale, a story contradicted by experience and lore, and something we can all feel in our bones.

The solution is to adopt a long-term perspective, to weigh choices over time instead of for convenience, and to think very carefully about what you give up. When you let others control the terms of engagement, don't be surprised if under the cover of polite every-day business, they absolutely screw you over.

Who Doesn't Go Nazi?

2021-12-16T00:00:00+01:00

The essay "Who goes Nazi" (1941) by Dorothy Thompson is a commonly cited classic. Through a fictional dinner party, we are introduced to various characters and personalities. Thompson analyzes whether they would or wouldn't make particularly good nazis.

Supposedly it comes down to this:

"Those who haven't anything in them to tell them what they like and what they don't—whether it is breeding, or happiness, or wisdom, or a code, however old-fashioned or however modern, go Nazi."

I have no doubt she was a keen social observer, that much is clear from the text. But I can't help but notice a big blind spot here.

If you're the kind of person to read and share this essay, satisfied about what it says about you and the world... what does that imply? Maybe that you needed someone else to tell you that? That you prefer to say it in their words rather than your own? Or even that you didn't have your own convictions sorted until then?

In other words, it seems "people who share Who goes Nazi?" is also a category of people who easily go nazi. What's more, in order to become an expert on what makes a particularly good nazi at a proto-nazi party, you have to be the kind of person who attends a lot of those parties in the first place.

So instead of two spidermen pointing at each other, let's ask a simpler question: who doesn't go nazi?

There's a pretty easy answer.

Brass Tacks

I bring this up because it's been impossible to miss lately: many people don't seem capable of recognizing totalitarianism unless it is polite enough to wear a swastika on its arm.

"Who doesn't go nazi" is anyone who is currently speaking up or protesting against lockdowns, curfews, QR-codes, mandatory vaccination, quarantine camps or similar. These are the people who, when a proto-fascist situation starts to develop, don't play along, or stand on the sidelines, but actually refuse to stay quiet. You can be pretty sure those people will not go nazi. It's everyone else you have to worry about.

I've gone to protest twice here already, and each time the crowd has been joyful, enormous and incredibly diverse. Not just left and right, white, brown and black. But upper and lower class. Christian or muslim. These were not anti-vax protests, and no wild riots either. Most people were there to oppose the QR-code, the harsh measures and the incompetent, lying politicians.

I go to represent myself, nobody else, but I've never felt any sense of embarrassment or shame to share a street with these people. On the whole they're fun, friendly and conscientious.

This opposition includes public servants like firemen, and also health care workers. Those last ones in particular have a very understandable grievance. They were heroes just a year ago, but today, they are threatened with job loss unless they get jabbed. In an already understaffed medical system, with an aged population. To make them undergo a medical procedure for which the manufacturer is not liable, and for which the governmental contracts have been kept secret.

A manufacturer paid with public money, in an industry with a proven track record of messing up human lives on enormous scales, and a history of trying to hide it.

The Real COVID Challenge

The reason we have to go along with all this, we are told, is solidarity. The need to look out for each other. Well, I find solidarity nowhere to be seen.

Because in many countries, a minority of people is being actively excluded from society and social life. In some places even cut off from buying groceries, even going outside. There is no limit to how many times they can be harassed and fined for their non-compliance.

At the same time, tons of people, who undoubtedly see themselves as empathetic and sensitive, are going out, acting like nothing's wrong. Some are even proudly saying the government should crack down harder, and make life truly miserable for those dirty vaccine refusers, until they comply.

To these people, I offer you the true COVID challenge. The pro-social, solidary thing to do is obvious: join them. Go out without your QR code, just once, for one afternoon or evening. See what happens.

Learn how it feels to have other citizens turn you away into the winter cold. Experience the drain of going door to door, wondering if the next one will be the one to let you have a simple drink or meal in peace. Maybe bring some QR'd friends along, so you can truly get into the role of being the 5th wheel nobody wants. Force everyone to sit outside with your mere presence. Ask them to buy and order things for you, like you're a teenager again.

Because that's what you want to inflict on other people every single hour of every single day for the rest of their free lives. Simply because they do not feel confident in a new medical treatment. Because let's face it: nobody knows if it's safe long term, if it failed to do what was promised after just 6 months. Why would you still believe anyone who claims otherwise?

And why, oh why, are the pillars of society dead set on shaming and punishing all the folks who weren't gullible enough? Shouldn't they be looking inward? Have they no shame?

Judas

There was recently a remarkable court judgement in the Netherlands. Thierry Baudet, of the Dutch Forum for Democracy, was forced to delete the following 4 tweets, which were judged to be unacceptably offensive (translated from Dutch):

"Deeply touched by the newsletter by @mauricedehond this morning. He's so right: the situation now is comparable to the '30s and '40s. The unvaccinated are the new jews, and the excluders who look away are the new nazi's and NSBers. There, I said it."

"Irony supreme! Ex-concentration camp Buchenwald is appying 2G policy [proof of recovery or vaccination] for an exhibit on... excluding people. How is it POSSIBLE to still not see how history is repeating?"

"Ask yourself: is this the country you want to live in? Where children who are "unvaccinated" can't go the Santa Claus parade? And have to be towelled off outside after swimming lessons? If not: RESIST! Don't participate in this apartheid, this exclusion! #FvD"

"Dear Jewish organizations:
1) The War does not belong to you but to us all.
2) Nobody compared the "holocaust" to the #apartheidspass, it was about the '30s
3) For 50 years, the "left" has done nothing but invoke the War
4) Look around you, what is happening NOW before our eyes!"

When people get outraged over supposedly offensive speech, often the person complaining isn't actually the one being insulted. Rather, they are taking offense on behalf of another party. When words are deemed hurtful, someone has a specific type of person in mind to whom those words are hurtful.

But in this case, Jewish organizations have gotten seriously offended over things some Jews are also saying, and doing so specifically as Jews. So who are these organizations actually representing?

Based on their behavior, it's as if they think nie wieder purely means that the Jewish people should never be persecuted again, as opposed to no group of people, of whatever ethnicity or conviction. That it inherently hurts the prospects of Jews to compare their historical plight to anyone else. It would seem they are taking an ethno-nationalist stance rather than a human rights stance. It ought to be painfully embarrassing for them, and it's not surprising they lash out. That doesn't make them right.

You can observe the same dynamic going on with the public and corona. When people are derisively labelled "anti-vaxxers" and selfish "hyperindividualists", the charge is that they are hurting society by helping spread the virus to the weaker members of society. But the people making the accusations seem to feel safe and confident enough to go out themselves and go party. Even though they can spread it too, and they are the majority of the population. In some places over 90% of adults. Who is being selfish?

The "unclean" are now actually stuck at home in many places, locked out of society. How are they still supposed to be driving anything now? It's absurd.

In fact, it seems to be the politicians and their royal advisors who are the hyperindividualists, deciding policy for millions. They never got consent to do so, and there is clearly no accountability for promises made. In some cases, they were literally never even elected.

* * *

It's all entirely backwards. It's not the unvaccinated who should feel ashamed, it's anyone who didn't speak up when an actual scapegoat underclass was created. When comparisons are judged not by their accuracy and implications, but by the emotional immaturity of anyone who might be listening.

They are now stuck with faith-based scientism, where matters are settled by unquestionable virologists and the PR departments of Pfizer and Moderna. But PR can't fix disasters, it can only pretend they didn't happen.

Know that the minute the tide turns, the loudest will immediately pretend to have believed so all along, to try and save face.

So stop blaming the scapegoats. It's not only stupid, it's inhumane. People like me will be here to remind you of that for the rest of time. Better get used to it.

Frickin' Shaders With Frickin' Laser Beams

2021-12-12T00:00:00+01:00

Hassle free GLSL

I've been working on a new library to compose GLSL shaders. This is part of a side project to come up with a composable and incremental way of driving WebGPU and GPUs in general.

#pragma import { getColor } from 'path/to/color' void main() { gl_FragColor = getColor(); }

The problem seems banal: linking together code in a pretty simple language. In theory this is a textbook computer science problem: parse the code, link the symbols, synthesize new program, done. But in practice it's very different. Explaining why feels itself like an undertaking.

From the inside, GPU programming can seem perfectly sensible. But from the outside, it's impenetrable and ridiculously arcane. It's so bad I made fun of it.

This might seem odd, given the existence of tools like ShaderToy: clearly GPUs are programmable, and there are several shader languages to choose from. Why is this not enough?

Well in fact, being able to render text on a GPU is still enough of a feat that someone has literally made a career out of it. There's a data point.

Another data point is that for almost every major engine out there, adopting it is virtually indistinguishable from forking it. That is to say, if you wish to make all but the most minor changes, you are either stuck at one version, or you have to continuously port your changes to keep up. There is very little shared cross-engine abstraction, even as the underlying native APIs remain stable over years.

When these points are raised, the usual responses are highly technical. GPUs aren't stack machines for instance, so there is no real recursion. This limits what you can do. There are also legacy reasons for certain features. Sometimes, performance and parallelism demands that some things cannot be exposed to software. But I think that's missing the forest for the trees. There's something else going on entirely. Much easier to fix.

Just Out of Reach

Let's take a trivial shader:

vec4 getColor(vec2 xy) { return vec4(xy, 0.0, 1.0); } void main() { vec2 xy = gl_FragIndex * vec2(0.001, 0.001); gl_FragColor = getColor(xy); }

This produces an XY color gradient.

In shaders, the main function doesn't return anything. The input and output are implicit, via global gl_… registers.

Conceptually a shader is just a function that runs for every item in a list (i.e. vertex or pixel), like so:

// On the GPU for (let i = 0; i < n; ++i) { // Run shader for every (i) and store result result[i] = shader(i); }

But the for loop is not in the shader, it's in the hardware, just out of reach. This shouldn't be a problem because it's such simple code: that's the entire idea of a shader, that it's a parallel map().

If you want to pass data into a shader, the specific method depends on the access pattern. If the value is constant for the entire loop, it's a uniform. If the value is mapped 1-to-1 to list elements, it's an attribute.

In GLSL:

// Constant layout (set = 0, binding = 0) uniform UniformType { vec4 color; float size; } UniformName;

// 1-to-1 layout(location = 0) in vec4 color; layout(location = 1) in float size;

Uniforms and attributes have different syntax, and each has its own position system that requires assigning numeric indices. The syntax for attributes is also how you pass data between two connected shader stages.

But all this really comes down to is whether you're passing color or colors[i] to the shader in the implicit for loop:

for (let i = 0; i < n; ++i) { // Run shader for every (i) and store result (uniforms) result[i] = shader(i, color, size); }

for (let i = 0; i < n; ++i) { // Run shader for every (i) and store result (attributes) result[i] = shader(i, colors[i], sizes[i]); }

If you want the shader to be able to access all colors and sizes at once, then this can be done via a buffer:

layout (std430, set = 0, binding = 0) readonly buffer ColorBufferType { vec4 colors[]; } ColorBuffer; layout (std430, set = 0, binding = 1) readonly buffer SizeBufferType { vec4 sizes[]; } SizeBuffer;

You can only have one variable length array per buffer, so here it has to be two buffers and two bindings. Unlike the single uniform block earlier. Otherwise you have to hardcode a MAX_NUMBER_OF_ELEMENTS of some kind.

Attributes and uniforms actually have subtly different type systems for the values, differing just enough to be annoying. The choice of uniform, attribute or buffer also requires 100% different code on the CPU side, both to set it all up, and to use it for a particular call. Their buffers are of a different type, you use them with a different method, and there are different constraints on size and alignment.

Only, it gets worse. Like CPU registers, bindings are a precious commodity on a GPU. But unlike CPU registers, typical tools do not help you whatsover in managing or hiding this. You will be numbering your bind groups all by yourself. Even more, if you have both a vertex and fragment shader, which is extremely normal, then you must produce a single list of bindings for both, across the two different programs.

And even then the above is all an oversimplification.

It's actually pretty crazy. If you want to make a shader of some type (A, B, C, D) => E, then you need to handroll a unique, bespoke definition for each particular A, B, C and D, factoring in a neighboring function that might run. This is based mainly on the access pattern for the underlying data: constant, element-wise or random, which forcibly determines all sorts of other unrelated things.

No other programming environment I know of makes it this difficult to call a plain old function: you have to manually triage and pre-approve the arguments on both the inside and outside, ahead of time. We normally just automate this on both ends, either compile or run-time.

It helps to understand why bindings exist. The idea is that most programs will simply set up a fixed set of calls ahead of time that they need to make, sharing much of their data. If you group them by kind, that means you can execute them in batches without needing to rebind most of the arguments. This is supposed to be highly efficient.

Though in practice, shader permutations do in fact reach high counts, and the original assumption is actually pretty flawed. Even a modicum of ability to modularize the complexity would work wonders here.

The shader from before could just be written to end in a pure function which is exported:

// ... #pragma export vec4 main(vec2 xy) { return getColor(xy * vec2(0.001, 0.001)); }

Using plain old functions and return values is not only simpler, but also lets you compose this module. This main can be called from somewhere else. It can be used by a new function vec2 => vec4 that you could substitute for it.

The crucial insight is that the rigid bureaucracy of shader bindings is just a very complicated calling convention for a function. It overcomplicates even the most basic programs, and throws composability out with the bathwater. The fact that there is a special set of globals for input/output, with a special way to specify 1-to-1 attributes, was a design mistake in the shader language.

It's not actually necessary to group the contents of a shader with the rules about how to apply that shader. You don't want to write shader code that strictly limits how it can be called. You want anyone to be able to call it any way they might possibly like.

So let's fix it.

Reinvent The Wheel

There is a perfectly fine solution for this already.

If you have a function, i.e. a shader, and some data, i.e. arguments, and you want to represent both together in a program... then you make a closure. This is just the same function with some of its variables bound to storage.

For each of the bindings above (uniform, attribute, buffer), we can define a function getColor that accesses it:

vec4 getColor(int index) { // uniform - constant return UniformName.color; }

vec4 getColor(int index) { // attribute - 1 to 1 return color; }

vec4 getColor(int index) { // buffer - random access return ColorBuffer.color[index]; }

Any other shader can define this as a function prototype without a body, e.g.:

vec4 getColor(int index);

You can then link both together. This is super easy when functions just have inputs and outputs. The syntax is trivial.

If it seems like I am stating the obvious here, I can tell you, I've seen a lot of shader code in the wild and virtually nobody takes this route.

The API of such a linker could be:

link : (module: string, links: Record) => string

Given some main shader code, and some named snippets of code, link them together into new code. This generates exactly the right shader to access exactly the right data, without much fuss.

But this isn't a closure, because this still just makes a code string. It doesn't actually include the data itself.

To do that, we need some kind of type T that represents shader modules at run-time. Then you can define a bind operation that accepts and returns the module type T:

bind : (module: T, links: Record) => T

This lets you e.g. express something like:

let dataSource: T = makeSource(buffer); let boundShader: T = bind(shader, {getColor: dataSource});

Here buffer is a GPU buffer, and dataSource is a virtual shader module, created ad-hoc and bound to that buffer. This can be made to work for any type of data source. When the bound shader is linked, it can produce the final manifest of all bindings inside, which can be used to set up and make the call.

That's a lot of handwaving, but believe me, the actual details are incredibly dull. Point is this:

If you get this to work end-to-end, you effectively get shader closures as first-class values in your program. You also end up with the calling convention that shaders probably should have had: the 1-to-1 and 1-to-N nature of data is expressed seamlessly through the normal types of the language you're in: is it an array or not? is it a buffer? Okay, thanks.

In practice you can also deal with array-of-struct to struct-of-arrays transformations of source data, or apply mathbox-like number emitters. Either way, somebody fills a source buffer, and tells a shader closure to read from it. That's it. That's the trick.

Shader closures can even represent things like materials too. Either as getters for properties, or as bound filters that directly work on values. It's just code + data, which can be run on a GPU.

When you combine this with a .glsl module system, and a loader that lets you import .glsl symbols directly into your CPU code, the effect is quite magical. Suddenly the gap between CPU and GPU feels like a tiny crack instead of the canyon it actually is. The problem was always just getting at your own data, which was not actually supposed to be your job. It was supposed to tag along.

Here is for example how I actually bind position, color, size, mask and texture to a simple quad shader, to turn it into an anti-aliased SDF point renderer:

import { getQuadVertex } from '@use-gpu/glsl/instance/vertex/quad.glsl'; import { getMaskedFragment } from '@use-gpu/glsl/mask/masked.glsl'; const vertexBindings = makeShaderBindings(VERTEX_BINDINGS, [ props.positions ?? props.position ?? props.getPosition, props.colors ?? props.color ?? props.getColor, props.sizes ?? props.size ?? props.getSize, ]); const fragmentBindings = makeShaderBindings(FRAGMENT_BINDINGS, [ (mode !== RenderPassMode.Debug) ? props.getMask : null, props.getTexture, ]); const getVertex = bindBundle( getQuadVertex, bindingsToLinks(vertexBindings) ); const getFragment = bindBundle( getMaskedFragment, bindingsToLinks(fragmentBindings) );

getVertex and getFragment are two new shader closures that I can then link to a general purpose main() stub.

I do not need to care one iota about the difference between passing a buffer, a constant, or a whole 'nother chunk of shader, for any of my attributes. The props only have different names so it can typecheck. The API just composes, and will even fill in default values for nulls, just like it should.

GP(GP(GP(GPU)))

What's neat is that you can make access patterns themselves a first-class value, which you can compose.

Consider the shader:

T getValue(int index); int getIndex(int index); T getIndexedValue(int i) { int index = getIndex(i); return getValue(index); }

This represents using an index buffer to read from a value buffer. This is something normally done by the hardware's vertex pipeline. But you can just express it as a shader module.

When you bind it to two data sources getValue and getIndex, you get a closure int => T that works as a new data source.

You can use similar patterns to construct virtual geometry generators, which start from one vertexIndex and produce complex output. No vertex buffers needed. This also lets you do recursive tricks, like using a line shader to make a wireframe of the geometry produced by your line shader. All with vanilla GLSL.

By composing higher-order shader functions, it actually becomes trivial to emulate all sorts of native GPU behavior yourself, without much boilerplate at all. Giving shaders a dead-end main function was simply a mistake. Everything done to work around that since has made it worse. void main() is just where currently one decent type system ends and an awful one begins, nothing more.

In fact, it is tempting to just put all your data into a few giant buffers, and use pointers into that. This already exists and is called "bindless rendering". But this doesn't remove all the boilerplate, it just simplifies it. Now instead of an assortment of native bindings, you mainly use them to pass around ints to buffers or images, and layer your own structs on top somehow.

This is a textbook case of the inner platform effect: when faced with an incomplete or limited API, eventually you will build a copy of it on top, which is more capable. This means the official API is so unproductive that adopting it actually has a negative effect. It would probably be a good idea to redesign it.

In my case, I want to construct and call any shader I want at run-time. Arbitrary composition is the entire point. This implies that when I want to go make a GPU call, I need to generate and link a new program, based on the specific types and access patterns of values being passed in. These may come from other shader closures, generated by remote parts of my app. I need to make sure that any subsequent draws that use that shader have the correct bindings ready to go, with all associated data loaded. Which may itself change. I would like all this to be declarative and reactive.

If you're a graphics dev, this is likely a horrible proposition. Each engine is its own unique snowflake, but they tend to have one thing in common: the only reason that the CPU side and the GPU side are in agreement is because someone explicitly spent lots of time making it so.

This is why getting past drawing a black screen is a rite of passage for GPU devs. It means you finally matched up all the places you needed to repeat yourself in your code, and kept it all working long enough to fix all the other bugs.

The idea of changing a bunch of those places simultaneously, especially at run-time, without missing a spot, is not enticing to most I bet. This is also why many games still require you to go back to the main screen to change certain settings. Only a clean restart is safe.

So let's work with that. If only a clean restart is safe, then the program should always behave exactly as if it had been restarted from scratch. As far as I know, nobody has been crazy enough to try and do all their graphics that way. But you can.

One way of doing that is with a memoized effect system. Mine is somewhere halfway between discount ZIO and discount React. The "effect" part ensures predictable execution, while the "memo" part ensures no redundant re-execution. It takes a while to figure out how to organize a basic WebGPU/Vulkan-like pipeline this way, but you basically just stare at the data dependencies for a very long time and keep untangling. It's just plain old code.

The main result is that changes are tracked only as granularly as needed. It becomes easy to ensure that even when a shader needs to be recompiled, you are still only recompiling 1 shader. You are not throwing away all other associated resources, state or caches, and the app does not need to do much work to integrate the new shader into subsequent calls immediately. That is, if you switch a binding to another of the same type, you can keep using the same shader.

The key thing is that I don't intend to make thousands of draw calls this way either. I just want to make a couple dozen of exactly the draw calls I need, preferably today, not next week. It's a radically different use case from what game engines need, which is what the current industry APIs are really mostly tailored for.

The best part is that the memoization is in no way limited to shaders. In fact, in this architecture, it always knows when it doesn't need to re-render, when nothing could have changed. Code doesn't actually run if that's the case. This is illustrated above by only having the points move around if the camera changes. For interactive graphics outside of games, this is actually a killer feature, yet it's something that's usually solved entirely ad-hoc.

One unanticipated side-effect is that when you add an inspector tool to a memoized effect system, you also get an inspector for every piece of significant state in your entire app.

On the spectrum of retained vs immediate mode, this perfectly hits that React-like sweet spot where it feels like immediate mode 90% of the time, even if it is retaining a lot behind the scenes. I highly recommend it, and it's not even finished yet.

* * *

A while ago I said something about "React VR except with Lisp instead of tears when you look inside". This is starting to feel a lot like that.

In the code, it looks absolutely nothing like any OO-style library I've seen for doing the same, which is a very good sign. It looks sort of similar, except it's as if you removed all code except the constructors from every class, and somehow, everything still keeps on working. It contains a fraction of the bookkeeping, and instead has a bunch of dependencies attached to hooks. There is not a single isDirty flag anywhere, and it's all driven by plain old functions, either Typescript or GLSL.

The effect system allows the run-time to do all the necessary orchestration, while leaving the specifics up to "user space". This does involve version counters on the inside, but only as part of automated change detection. The difference with a dirty flag might seem like splitting hairs, but consider this: you can write a linter for a hook missing a dependency, but you can't write a linter for code missing a dirty flag somewhere. I know which one I want.

Right now this is still just a mediocre rendering demo. But from another perspective, this is a pretty insane simplification. In a handful of reactive components, you can get a proof-of-concept for something like Deck.GL or MapBox, in a fraction of the code it takes those frameworks. Without a bulky library in between that shields you from the actual goodies.

The Coddling of the Professional Mind

2021-10-02T00:00:00+02:00

"The problem isn't that Johnny can't read. The problem isn't even that Johnny can't think. The problem is that Johnny doesn't know what thinking is; he confuses it with feeling."
– Thomas Sowell

I'm not one to miss an important milestone, so let me draw your attention to a shift in norms that's taking place in the Ruby open source community: it's now no longer expected to be tolerant of views that differ.

This ought to be a remarkable change: previously, a common refrain was that "in order to be tolerant, we cannot tolerate intolerance." This was the rationale for excluding certain people, under the guise of inclusivity. Well, that line of reasoning is now on its way out, and intolerance is now openly advocated for, with lots of heart emoji to boot.

The Anatomy of Man - Da Vinci (1513)

Code of Misconduct

Source for this is a series of changes to the Ruby Code of Conduct, which subtly tweak the language. The stated rationale is to "remove abuse enabling language."

There are a few specific shifts to notice here:

Objections no longer have to be based on reasonable concerns.

All that matters is that someone could consider something to be harassing behavior.

Behavior is now mainly unacceptable if it targets protected classes.

Tolerance of opposing views is removed entirely as expected conduct.

Also noticeable is that this is done through multiple small changes, each stacking on top of the next over a few days, as a perfect illustration of "boiling the frog."

This ought to set off alarm bells. If concerns no longer have to be reasonable, then completely unreasonable complaints will have to be taken seriously. If opposing views are no longer welcome, then casting doubt on accusations of abuse is also misconduct. If only protected classes are singled out as worthy of protection, then it creates a grey area of traits which are acceptable to use as weapons to bully people.

It shouldn't take much imagination to see how these changes can actually enable abuse, if you know how emotional blackmail works: it's when an abuser makes other people responsible for managing the abuser's feelings, which are unstable and not grounded in mutual respect and obligation. If Alice's behavior causes Bob to be upset, Bob castigates Alice as an offender. If Bob's behavior causes Alice to be upset, then Alice is making Bob feel unsafe, and it's still Alice's fault, who needs to make amends.

A good example is how the social interaction style of people with autism can be trivially recast as deliberate insensitivity. Cancelled Googler James Damore made exactly this point in The Neurodiversity Case for Free Speech. This is also excellently illustrated in Splain it to Me which highlights how one person's gift of information can almost always be recast as an attempt to embarrass another as ignorant.

For all this to seem sensible, the people involved have to have enormous blinders on, suffering from the phenomenon that Sowell so aptly described: the focus isn't on thinking out a set of effective and consistent rules, but rather on letting the feelings do the driving, letting the most volatile members dominate over everyone else. Quite possibly they themselves have one or more emotional abusers in their lives, who have trained them to see such asymmetry as normal. "Heads I win, tails you lose" is a recipe for gaslighting, after all.

The Ruby community is of course free to decide what constitutes acceptable behavior. But there is little evidence there is widespread support for such a change. On HackerNews, the change in policy was widely criticized. Discussion on the proposals themselves was locked within a day, for being "too heated," despite involving only a handful of people. This moderator action seems itself an example of the new policy, letting feelings dominate over reality: after proposing a controversial change, maintainers plug their ears because they do not wish to hear opposing views, even before they are actually uttered in full.

Marco Dente (ca. 1515-1527)

Harassment Policy

Way back in 2013, something similar happened at the PyCon conference in the notorious DongleGate incident. After overhearing a joke between two men seated in the audience, activist Adria Richards decided to take the offenders' picture and post it on Twitter. She was widely praised in media for doing so, and it resulted in the loss of the jokester's job.

What was crucial to notice, and which many people didn't, was that "harassing photography" was explicitly against the conference's anti-harassment policy. By any reasonable interpretation of the rules, Richards was the harasser, who wielded social media as a weapon for intimidation. She should've been sanctioned and told in no uncertain terms that such behavior was not welcome.

Of course, that did not happen. Citing concerns about women in tech, she appealed exactly to those "protected classes" to justify her behavior. She cast herself in the role of defender of women, while engaging in an unquestionable attack.

It's easy to show that this was not motivated by fairness or equality: had the joke been made by a woman instead, Richards wouldn't have been able to make the same argument. The accusation of sexism seemed to derive from the sexual innuendo in the joke, an assumed male-only trait. Indeed, the only reason it worked was because of her own sexism: she assumed that when one man makes a joke, he is an avatar of oppression by men in the entire industry. She treated him differently because of his sex, so her accusation of sexism was a cover for her own.

Even more ridiculous was that her actual job was "Developer Relations." She was supposedly tasked with improving relations with and between developers, but did the exact opposite, creating a scandal that would resonate for years. What it really showed was that she was volatile and a liability for any company that would hire her in this role.

Somehow, this all went unnoticed. Nobody involved seemed to actually think it through. The entire story ran purely on hurt feelings, narrating the entire experience from one person's subjective point of view. This is now a common thread in many environments that are supposed to be professional: the people in charge have no idea how to keep their own members in check, and allow them to hijack everyone's resources and time for grievances and external drama.

As a rare counter-example, consider crypto-exchange CoinBase. They explicitly went against the grain a year ago, by announcing they were a mission-focused company, who would concentrate their efforts on their actual core competence. Today, things are looking much brighter for them, as the negative response and doom-saying in media turned out to be entirely irrelevant. On the inside, the reaction was mostly positive. The employees that left in anger were eventually replaced, with a group of equally diverse people.

The School of Athens - Raphael (1508)

Professing

Professionalism seems to be a concept that is very poorly understood. In the direct sense, it's a set of policies and strategies that allow people with wildly different interests to come together and get productive work done regardless.

In a world where many people wish to bring "their entire selves to work," this can't happen. If it's more important to keep everyone's feelings in check, and less important to actually deliver results, then there's no room for fixing mistakes. It creates an environment where pointing out problems is considered an unwelcome insensitivity, to which the response is to gang up on the messenger and shoot them for being abusive.

The most common strategy is simply to shame people into silence. If that doesn't work, their objections are censored out of sight, and then reframed as bigotry if anyone asks. The narrative machine will spin up again, using emotionally charged terms such as "harassment" and "sexism."

The idea of "victim blaming" is particularly pernicious here: any time someone invokes it, without knowing all the details, they must have pre-assumed they know who is the victim and who is the offender. This is where the concept of "protected classes" comes into play again.

While it's supposed to mean that we cannot discriminate e.g. on the basis of sex, what it means in practice is that one assumes automatically that men are the offenders and that women are being victimized. Even if it's the other way around. Indeed, such a model is the cornerstone of intersectionality, a social theory which teaches that on every demographic axis, one can identify exclusive categories of oppressors and the oppressed. White oppresses black, straight oppresses gay, cis oppresses trans, and so on.

If you engage such bigoteers in debate, the experience is pretty much like talking to a brick wall. You are not speaking to someone who is interested in being correct, merely in remaining on the right side. This seems to be the axiom from which they start, and a core part of their self-image. If you insist on peeling off the fallacies and mistakes in reasoning, you only invoke more ire. Your line of reasoning is upsetting to them, and therefor, you are a bigot who needs to leave, or be forcefully expelled. In the name of tolerance, for the sake of diversity and inclusion, they flatten the actual complexities of life and become utterly intolerant and exclusionary.

It's no coincidence that these cultural flare ups first came to a head in environments like open source, where results speak the loudest. Or in STEM and video games, where merit reigns supreme. When faced with widespread competence, the incompetent resort to lesser weapons and begin to undermine social norms, to try and mend the gap between their self-image and what they are actually able to do.

* * *

Personally, I'm quite optimistic, because the game is now clearly visible. In their zeal for ideological purity, activists have blown straight past their own end zone. When they tell you they are no longer interested in tolerance, you should believe them. It represents a complete abandonment of the principles that allowed liberal society to grow and flourish.

That means tolerance now again belongs to the adults in the room, who are able to separate fact from fiction, and feelings from actual principled conviction. We can only hope these children finally learn.

In Search of Sophistication

2021-09-11T00:00:00+02:00

Cultural Assimilation, Theory vs Practice

The other day, I read the following, shared 22,000+ times on social media:

"Broken English is a sign of courage and intelligence, and it would be nice if more people remembered that when interacting with immigrants and refugees."

This resonates with me, as I spent 10 years living on the other side of the world. Eventually I lost my accent in English, which took conscious effort and practice. These days I live in a majority French city and neighborhood, as a native Dutch speaker. When I need to call a plumber, I first have to go look up the words for "drainage pipe." When my barber asks me what kind of cut I want, it mostly involves gesturing and "short".

This is why I am baffled by the follow-up, by the same person:

"Thanks to everyone commenting on the use of 'broken' to describe language. You're right. It is problematic. I'll use 'beginner' from now on."

It's not difficult to imagine the pile-on that must've happened for the author to add this note. What is difficult to imagine is that anyone who raised the objection has actually ever thought about it.

Minesweeper

Consider what this situation looks like to an actual foreigner who is learning English and trying to speak it. While being ostensibly lauded for their courage, they are simultaneously shown that the English language is a minefield where an expression as plain as "broken English" is considered a faux pas, enough to warrant a public correction and apology.

To stay in people's good graces, you must speak English not as the dictionary teaches you, but according to the whims and fashions of a highly volatile and easily triggered mass. They effectively demand you speak a particular dialect, one which mostly matches the sensibilities of the wealthier, urban parts of coastal America. This is an incredibly provincial perspective.

The objection relies purely on the perception that "broken" is a word with a negative connotation. It ignores the obvious fact that people who speak a language poorly do so in a broken way: they speak with interruptions, struggling to find words, and will likely say things they don't quite mean. The dialect demands that you pretend this isn't so, by never mentioning it directly.

But in order to recognize the courage and intelligence of someone speaking a foreign language, you must be able to see past such connotations. You must ignore the apparent subtleties of the words, and try to deduce the intended meaning of the message. Therefor, the entire sentiment is self-defeating. It fell on such deaf ears that even the author seemingly missed the point. One must conclude that they don't actually interact with foreigners much, at least not ones who speak broken English.

The sentiment is a good example of what is often called a luxury belief: a conviction that doesn't serve the less fortunate or abled people it claims to support. Often the opposite. It merely helps privileged, upper-class people feel better about themselves, by demonstrating to everyone how sophisticated they are. That is, people who will never interact with immigrants or refugees unless they are already well integrated and wealthy enough.

By labeling it as "beginner English," they effectively demand an affirmation that the way a foreigner speaks is only temporary, that it will get better over time. But I can tell you, this isn't done out of charity. Because I have experienced the transition from speaking like a foreigner to speaking like one of them. People treat you and your ideas differently. In some ways, they cut you less slack. In other ways, it's only then that they finally start to take you seriously.

Let me illustrate this with an example that sophisticates will surely be allergic to. One time, while at a bar, when I still had my accent, I attempted to colloquially use a particular word. That word is "nigga." With an "a" at the end. In response, there was a proverbial record scratch, and my companions patiently and carefully explained to me that that was a word that polite people do not use.

No shit, Sherlock. You live on a continent that exports metric tons of gangsta rap. We can all hear and see it. It's really not difficult to understand the particular rules. Bitch, did I stutter?

Even though I had plenty of awareness of the linguistic sensitivities they were beholden to, in that moment, they treated me like an idiot, while playing the role of a more sophisticated adult. They saw themselves as empathetic and concerned, but actually demonstrated they didn't take me fully seriously. Not like one of them at all.

If you want people's unconditional respect, here's what did work for me: you go toe-to-toe with someone's alcoholic wine aunt at a party, as she tries to degrade you and your friend, who is the host. You effortlessly spit back fire in her own tongue and get the crowd on your side. Then you casually let them know you're not even one of them, not one bit. Jawdrops guaranteed.

This is what peak assimilation actually looks like.

The Ethnic Aisle

In a similar vein, consider the following, from NYT Food:

"Why do American grocery stores still have an ethnic aisle?

The writer laments the existence of segregated foods in stores, and questions their utility. "Ethnic food" is a meaningless term, we are told, because everyone has an ethnicity. Such aisles even personify a legacy of white supremacy and colonialism. They are an anachronism which must be dismantled and eliminated wholesale, though it "may not be easy or even all that popular."

We do get other perspectives: shop owners simply put products where their customers are most likely to go look for them. Small brands tend to receive obscure placement, while larger brands get mixed in with the other foods, which is just how business goes. The ethnic aisle can also signal that the products are the undiluted original, rather than a version adapted to local palates. Some native shoppers explicitly go there to discover new ingredients or flavors, and find it convenient.

More so, the point about colonialism seems to be entirely undercut by the mention of "American aisles" in other countries, containing e.g. peanut butter, BBQ sauce and boxed cake mix. It cannot be colonialism on "our" part both when "we" import "their" products, as well as when "they" import "ours". That's just called trade.

Along the way, the article namedrops the exotic ingredients and foreign brands that apparently should just be mixed in with the rest: cassava flour, pomegranate molasses, dal makhani, jollof rice seasoning, and so on. We are introduced to a whole cast of business owners "of color," with foreign-sounding names. We are told about the "desire for more nuanced storytelling," including two sisters who bypassed stores entirely by selling online, while mocking ethnic aisles on TikTok. Which we all know is the most nuanced of places.

I find the whole thing preposterous. In order to even consider the premise, you already have to live in an incredibly diverse, cosmopolitan city. You need to have convenient access to products imported from around the world. This is an enormous luxury, enabled by global peace and prosperity, as well as long-haul and just-in-time logistics. There, you can open an app on your phone and have top-notch world cuisine delivered to your doorstep in half an hour.

For comparison, my parents are in their 70s and they first ate spaghetti as teenagers. Also, most people here still have no clue what to do with fish sauce other than throw it away as soon as possible, lest you spill any. This is fine. The expectation that every cuisine is equally commoditized in your local corner store is a huge sign of privilege, which reveals how provincial the premise truly is. It ignores that there are wide ranging differences between countries in what is standard in a grocery store, and what people know how to make at home.

Even chips flavors can differ wildly from country to country, from the very same multinational brands. Did you know paprika chips are the most common thing in some places, and not a hipster food?

Crucially, in a different time, you could come up with the same complaints. In the past it would be about foods we now consider ordinary. In the future it would be about things we've never even heard of. While the story is presented as a current issue for the current times, there is nothing to actually support this.

To me, this ignorance is a feature, not a bug. The point of the article is apparently to waffle aimlessly while namedropping a lot of things the reader likely hasn't heard of. The main selling point is novelty, which paints the author and their audience as being particularly in-the-know. It lets them feel they are sophisticated because of the foods they cook and eat, as well as the people they know and the businesses they frequent. If you're not in this loop, you're supposed to feel unsophisticated and behind the times.

It's no coincidence that this is published in the New York Times. New Yorkers have a well-earned reputation for being oblivious about life outside their bubble: the city offers the sense that you can have access to anything, but its attention is almost always turned inwards. It's not hard to imagine why, given the astronomical cost of living: surely it must be worth it! And yes, I have in fact spent a fair amount of time there, working. It couldn't just be that life elsewhere is cheaper, safer, cleaner and friendlier. That you can reach an airport in less than 2 hours during rush hour. On a comfortable, modern train. Which doesn't look and smell like an ashtray that hasn't been emptied out since 1975.

But I digress.

"Ethnic aisles are meaningless because everyone has an ethnicity" is revealed to be a meaningless thought. It smacks headfirst into the reality of the food business, which is a lesson the article seems determined not to learn. When "diversity" turns out to mean that people are actually diverse, have different needs and wants, and don't all share the same point of view, they just think diversity is wrong, or at least, outmoded, a "necessary evil." Even if they have no real basis of comparison.

Negative Progress

I think both stories capture an underlying social affliction, which is about progress and progressivism.

The basic premise of progressivism is seemingly one of optimism: we aim to make the future better than today. But the way it often works is by painting the present as fundamentally flawed, and the past as irredeemable. The purpose of adopting progressive beliefs is then to escape these flaws yourself, at least temporarily. You make them other people's fault by calling for change, even demanding it.

What is particularly noticeable is that perceived infractions are often in defense of people who aren't actually present at all. The person making the complaint doesn't suffer any particular injury or slight, but others might, and this is enough to condemn in the name of progress. "If an [X] person saw that, they'd be upset, so how dare you?" In the story of "broken English," the original message doesn't actually refer to a specific person or incident. It's just a general thing we are supposed to collectively do. That the follow-up completely contradicts the premise, well, that apparently doesn't matter. In the case of the ethnic aisle, the contradictory evidence is only reluctantly acknowledged, and you get the impression they had hoped to write a very different story.

This too is a provincial belief masquerading as sophistication. It mashes together groups of people as if they all share the exact same beliefs, hang-ups and sensitivities. Even if individuals are all saying different things, there is an assumed archetype that overrules it all, and tells you what people really think and feel, or should feel.

To do this, you have to see entire groups as an "other," as people that are fundamentally less diverse, self-aware and curious than the group you're in. That they need you to stand up for them, that they can't do it themselves. It means that "inclusion" is often not about including other groups, but about dividing your own group, so you can exclude people from it. The "diversity" it seeks reeks of blandness and commodification.

In the short term it's a zero-sum game of mining status out of each other, but in the long run everyone loses, because it lets the most unimaginative, unworldly people set the agenda. The sense of sophistication that comes out of this is imaginary: it relies on imagining fault where there is none, and playing meaningless word games. It's not about what you say, but how you say it, and the rules change constantly. Better keep up.

Usually this is associated with a profound ignorance about the actual past. This too is a status-mining move, only against people who are long gone and can't defend themselves. Given how much harsher life was, with deadly diseases, war and famine regular occurences, our ancestors had to be far smarter, stronger and self-sufficient, just to survive. They weren't less sophisticated, they came up with all the sophisticated things in the first place.

When it comes to the more recent past, you get the impression many people still think 1970 was 30, not 51 years ago. The idea that everyone was irredeemably sexist, racist and homophobic barely X years ago just doesn't hold up. Real friendships and relationships have always been able to transcend larger social matters. Vice versa, the idea that one day, everyone will be completely tolerant flies in the face of evidence and human nature. Especially the people who loudly say how tolerant they are: there are plenty of skeletons in those closets, you can be sure of that.

* * *

There's a Dutch expression that applies here: claiming to have invented hot water. To American readers, I gotta tell you: it really isn't hard to figure out that America is a society stratified by race, or exactly how. I figured that out the first time I visited in 2001. I hadn't even left the airport in Philadelphia when it occurred to me that every janitor I had seen was both black and morbidly obese. Completely unrelated, McDonald's was selling $1 cheeseburgers.

Later in the day, a black security guard had trouble reading an old-timey handwritten European passport. Is cursive racist? Or is American literacy abysmal because of fundamental problems in how school funding is tied to property taxes? You know this isn't a thing elsewhere, right?

In the 20 years since then, nothing substantial has improved on this front. Quite the opposite: many American schools and universities have abandoned their mission of teaching, in favor of pushing a particular worldview on their students, which leaves them ill-equipped to deal with the real world.

Ironically this has created a wave of actual American colonialism, transplanting the ideology of intersectionality onto other Western countries where it doesn't apply. Each country has their own long history of ethnic strife, with entirely different categories. The aristocrats who ruled my ancestors didn't even let them get educated in our own language. That was a right people had to fight for in the late 1960s. You want to tell me which words I should capitalize and which I shouldn't? Take a hike.

Not a year ago, someone trying to receive health care here in Dutch was called racist for it, by a French speaker. It should be obvious the person who did so was 100% projecting. I suspect insecurity: Dutch speakers are commonly multi-lingual, but French speakers are not. When you are surrounded by people who can speak your language, when you don't speak a word of theirs, the moron is you, but the ego likes to say otherwise. So you pretend yours is the sophisticated side.

All it takes to pierce this bubble is to actually put the platitudes and principles to the test. No wonder people are so terrified.

On Variance and Extensibility

2021-08-29T00:00:00+02:00

Making code reusable is not an art, it's a job

Extensibility of software is a weird phenomenon, very poorly understood in the software industry. This might seem strange to say, as you are reading this in a web browser, on an operating system, desktop or mobile. They are by all accounts, quite extensible and built out of reusable, shared components, right?

But all these areas are suffering enormous stagnation. Microsoft threw in the towel on browsers. Mozilla fired its engineers. Operating systems have largely calcified around a decades-old feature set, and are just putting up fortifications. The last big shift here was Apple's version of the mobile web and app store, which ended browser plug-ins like Flash or Java in one stroke.

Most users are now silo'd inside an officially approved feature set. Except for Linux, which is still figuring out how audio should work. To be fair, so is the web. There's WebAssembly on the horizon, but the main thing it will have access to is a creaky DOM and an assortment of poorly conceived I/O APIs.

It sure seems like the plan was to have software work much like interchangeable legos. Only it didn't happen at all, not as far as end-users are concerned. Worse, the HTTP-ificiation of everything has largely killed off the cow paths we used to have. Data sits locked behind proprietary APIs. Interchange doesn't really happen unless there is a business case for it. The default of open has been replaced with a default of closed.

This death of widespread extensibility ought to seem profoundly weird, or at least, ungrappled with.

We used to collect file types like Pokémon. What happened? If you dig into this, you work your way through types, but then things quickly get existential: how can a piece of code do anything useful with data it does not understand? And if two programs can interpret and process the same data the same way, aren't they just the same code written twice?

Most importantly: does this actually tell us anything useful about how to design software?

The Birds of America, John James Audubon (1827)

Well?

Let's start with a simpler question.

If I want a system to be extensible, I want to replace a part with something more specialized, more suitable to my needs. This should happen via the substitution principle: if it looks like a duck, walks like a duck and quacks like a duck, it's a duck, no matter which kind. You can have any number of sub-species of ducks, and they can do things together, including making weird new little ducks.

So, consider:

If I have a valid piece of code that uses the type Animal, I should be able to replace Animal with the subtype Duck, Pig or Cow and still have valid code.

True or False? I suspect your answer will depend on whether you've mainly written in an object-oriented or functional style. It may seem entirely obvious, or not at all.

This analogy by farm is the usual intro to inheritance: Animal is the supertype. When we call .say(), the duck quacks, but the cow moos. The details are abstracted away and encapsulated. Easy. We teach inheritance and interfaces this way to novices, because knowing what sounds your objects make is very important in day-to-day coding.

But, seriously, this obscures a pretty important distinction. Understanding it is crucial to making extensible software. Because the statement is False.

So, the farmer goes to feed the animals:

type GetAnimal = () => Animal; type FeedAnimal = (animal: Animal) => void;

How does substitution apply here? Well, it's fine to get ducks when you were expecting animals. Because anything you can do to an Animal should also work on a Duck. So the function () => Duck can stand-in for an () => Animal.

But what about the actions? If I want to feed the ducks breadcrumbs, I might use a function feedBread which is a Duck => void. But I can't feed that same bread to the cat and I cannot pass feedBread to the farmer who expects an Animal => void. He might try to call it on the wrong Animal.

This means the allowable substitution here is reversed depending on use:

A function that provides a Duck also provides an Animal.

A function that needs an Animal will also accept a Duck.

But it doesn't work in the other direction. It seems pretty obvious when you put it this way. In terms of types:

Any function () => Duck is a valid substitute for () => Animal.

Any function Animal => void is a valid substitute for Duck => void.

It's not about using a type T, it's about whether you are providing it or consuming it. The crucial distinction is whether it appears after or before the =>. This is why you can't always replace Animal with Duck in just any code.

This means that if you have a function of a type T => T, then T appears on both sides of =>, which means neither substitution is allowed. You cannot replace the function with an S => S made out of a subtype or supertype S, not in general. It would either fail on unexpected input, or produce unexpected output.

This shouldn't be remarkable at all among people who code in typed languages. It's only worth noting because intros to OO inheritance don't teach you this, suggesting the answer is True. We use the awkward words covariant and contravariant to describe the two directions, and remembering which is which is hard.

I find this quite strange. How is it people only notice one at first?

let duck: Duck = new Duck(); let animal: Animal = duck;

class Duck extends Animal { method() { // ... } }

Here's one explanation. First, you can think of ordinary values as being returned from an implicit getter () => value. This is your default mental model, even if you never really thought about it.

Second, it's OO's fault. When you override a method in a subclass, you are replacing a function (this: Animal, ...) => with a function (this: Duck, ...) => . According to the rules of variance, this is not allowed, because it's supposed to be the other way around. To call it on an Animal, you must invoke animal.say() via dynamic dispatch, which the language has built-in.

Every non-static method of class T will have this: T as a hidden argument, so this constrains the kinds of substitutions you're allowed to describe using class methods. Because when both kinds of variance collide, you are pinned at one level of abstraction and detail, because there, T must be invariant.

This is very important for understanding extensibility, because the common way to say "neither co- nor contravariant" is actually just "vendor lock-in".

The Mirage of Extensibility

The goal of extensibility is generally threefold:

Read from arbitrary sources of data

Perform arbitrary operations on that data

Write to arbitrary sinks of data

Consider something like ImageMagick or ffmpeg. It operates on a very concrete data type: one or more images (± audio). These can be loaded and saved in a variety of different formats. You can apply arbitrary filters as a processing pipeline, configurable from the command line. These tools are swiss army knives which seem to offer real extensibility.

type Input = () => T; type Process = T => T; type Output = T => void;

Formally, you decode your input into some shared representation T. This forms the glue between your processing blocks. Then it can be sent back to any output to be encoded.

It's crucial here that Process has the same input and output type, as it enables composition of operations like lego. If it was Process instead, you would only be able to chain certain combinations (A → B, B → C, C → D, ...). We want to have a closed, universal system where any valid T produces a new valid T.

Of course you can also define operators like (T, T) => T. This leads to a closed algebra, where every op always works on any two Ts. For the sake of brevity, operators are implied below. In practice, most blocks are also configurable, which means it's an options => T => T.

This seems perfectly extensible, and a textbook model for all sorts of real systems. But is it really? Reality says otherwise, because it's engineering, not science.

Consider a PNG: it's not just an image, it's a series of data blocks which describe an image, color information, physical size, and so on. To faithfully read a PNG and write it out again requires you to understand and process the file at this level. Therefor any composition of a PNGInput with a PNGOutput where T is just pixels is insufficient: it would throw away all the metadata, producing an incomplete file.

Now add in JPEG: same kind of data, very different compression. There are also multiple competing metadata formats (JFIF, EXIF, ...). So reading and writing a JPEG faithfully requires you to understand a whole new data layout, and store multiple kinds of new fields.

This means a swiss-army-knife's T is really some kind of wrapper in practice. It holds both data and metadata. The expectation is that operations on T will preserve that metadata, so it can be reattached to the output. But how do you do that in practice? Only the actual raw image data is compatible between PNG and JPEG, yet you must be able to input and output either.

meta = { png?: {...} jpeg?: {...} }

If you just keep the original metadata in a struct like this, then a Process interested in metadata has to be aware of all the possible image formats that can be read, and try them all. This means it's not really extensible: adding a new format means updating all the affected Process blocks. Otherwise Input and Process don't compose in a useful sense.

meta = { color: {...}, physical: {...}, geo: {...}, }

If you instead harmonize all the metadata into a single, unified schema, then this means new Input and Output blocks are limited to metadata that's already been anticipated. This is definitely not extensible, because you cannot support any new concepts faithfully.

If you rummage around inside ImageMagick you will in fact encounter this. PNG and JPEG's unique flags and quirks are natively supported.

meta = { color: {...}, physical: {...}, geo: {...}, x-png?: {...} x-jpeg?: {...} }

One solution is to do both. You declare a standard schema upfront, with common conventions that can be relied on by anyone. But you also provide the ability to extend it with custom data, so that specific pairs of Input/Process/Output can coordinate. HTTP and e-mail headers are X-able.

meta = { img?: { physical?: {...}, color?: {...}, }, fmt?: { png?: {...}, jfif?: {...}, exif?: {...}, }, },

The problem is that there is no universal reason why something should be standard or not. Standard is the common set of functionality "we" are aware of today. Non-standard is what's unanticipated. This is entirely haphazard. For example, instead of an x-jpeg, it's probably better to define an x-exif because Exif tags are themselves reusable things. But why stop there?

Mistakes stick and best practices change, so the only way to have a contingency plan in place is for it to already exist in the previous version. For example, through judicious use of granular, optional namespaces.

The purpose is to be able to make controlled changes later that won't mess with most people's stuff. Some breakage will still occur. The structure provides a shared convention for anticipated needs, paving the cow paths. Safe extension is the default, but if you do need to restructure, you have to pick a new namespace. Conversion is still an issue, but at least it is clearly legible and interpretable which parts of the schema are being used.

One of the smartest things you can do ahead of time is to not version your entire format as v1 or v2. Rather, remember the version for any namespace you're using, like a manifest. It allows you to define migration not as a transaction on an entire database or dataset at once, but rather as an idempotent process that can be interrupted and resumed. It also provides an opportunity to define a reverse migration that is practically reusable by other people.

This is how you do it if you plan ahead. So naturally this is not how most people do it.

N+1 Standards

X- fields and headers are the norm and have a habit of becoming defacto standards. When they do, you find it's too late to clean it up into a new standard. People try anyway, like with X-Forwarded-For vs Forwarded. Or -webkit-transform vs transform. New software must continue to accept old input. It must also produce output compatible with old software. This means old software never needs to be updated, which means new software can never ditch its legacy code.

Let's look at this story through a typed lens.

What happens is, someone turns an Animal => Animal into a Duck => Duck without telling anyone else, by adding an X- field. This is fine, because Animal ignores unknown metadata, and X- fields default to none. Hence every Animal really is a valid Duck, even though Duck specializes Animal.

Slowly more people replace their Animal => Animal type with Duck => Duck. Which means ducks are becoming the new defacto Animal. But then someone decides it needed to be a Chicken => Chicken instead, and that chickens are the new Animal. Not everyone is on-board with that.

So you need to continue to support the old Duck and the new Chicken on the input side. You also need to output something that passes as both Duck and Chicken, that is, a ChickenDuck. Your signature becomes:

(Duck | Chicken | ChickenDuck) => ChickenDuck

This is not what you wanted at all, because it always lays two eggs at the same time, one with an X and one without. This is also a metaphor for IPv4 vs IPv6.

If you have one standard, and you make a new standard, now you have 3 standards: the old one, the theoretical new one, and the actual new one.

T in a chain with S=>S">

Invariance pops up again. When you have a system built around signature T => T, you cannot simply slot in a S => S of a super or sub S. Most Input and Output in the wild still only produces and consumes T. You have to slot in an actual T => S somewhere, and figure out what S => T means.

Furthermore, for this to do something useful in a real pipeline, T already has to be able to carry all the information that S needs. And S => T cannot strip it out again. The key is to circumvent invariance: neither type is really a subtype or supertype of the other. They are just different views and interpretations of the same underlying data, which must already be extensible enough.

Backwards compatibility is then the art of changing a process Animal => Animal into a Duck => Duck while avoiding a debate about what specifically constitutes quacking. If you remove an X- prefix to make it "standard", this stops being true. The moment you have two different sources of the same information, now you have to decide whose job it is to resolve two into one, and one back into two.

This is particularly sensitive for X-Forwarded-For, because it literally means "the network metadata is wrong, the correct origin IP is ..." This must come from a trusted source like a reverse proxy. It's the last place you want to create compatibility concerns.

If you think about it, this means you can never be sure about any Animal either: how can you know there isn't an essential piece of hidden metadata traveling along that you are ignoring, which changes the meaning significantly?

Consider what happened when mobile phones started producing JPEGs with portrait vs landscape orientation metadata. Pictures were randomly sideways and upside down. You couldn't even rely on something as basic as the image width actually being, you know, the width. How many devs would anticipate that?

The only reason this wasn't a bigger problem is because for 99% of use-cases, you can just apply the rotation once upfront and then forget about it. That is, you can make a function RotatableImage => Image aka a Duck => Animal. This is an S => T that doesn't lose any information anyone cares about. This is the rare exception, only done occasionally, as a treat.

If you instead need to upgrade a whole image and display pipeline to support, say, high-dynamic range or P3 color, that's a different matter entirely. It will never truly be 100% done everywhere, we all know that. But should it be? It's another ChickenDuck scenario, because now some code wants images to stay simple, 8-bit and sRGB, while other code wants something else. Are you going to force each side to deal with the existence of the other, in every situation? Or will you keep the simplest case simple?

A plain old 2D array of pixels is not sufficient for T in the general case, but it is too useful on its own to simply throw it out. So you shouldn't make an AbstractImage which specializes into a SimpleImage and an HDRImage and a P3Image, because that means your SimpleImage isn't simple anymore. You should instead make an ImageView with metadata, which still contains a plain Image with only raw pixels. That is, a SimpleImage is just an ImageView. That way, there is still a regular Image on the inside. Code that provides or needs an Image does not need to change.

It's important to realize these are things you can only figure out if you have a solid idea of how people actually work with images in practice. Like knowing that we can just all agree to "bake in" a rotation instead of rewriting a lot of code. Architecting from the inside is not sufficient, you must position yourself as an external user of what you build, someone who also has a full-time job.

If you want a piece of software to be extensible, that means the software will become somebody else's development dependency. This puts huge constraints on its design and how it can change. You might say there is no such thing as an extensible schema, only an unfinished schema, because every new version is really a fork. But this doesn't quite capture it, and it's not quite so absolute in practice.

Colonel Problem

Interoperability is easy in pairs. You can model this as an Input "A" connecting to an Output "B". This does not need to cover every possible T, it can be a reduced subset R of T. For example, two apps exchange grayscale images (R) as color PNGs (T). Every R is also a T, but not every T is an R. This means:

Any function () => grayscaleImage is a valid substitute for () => colorImage.

Any function (colorImage) => void is a valid substitute for (grayscaleImage) => void.

This helps A, which is an => R pretending to be a => T. But B still needs to be an actual T =>, even if it only wants to be an R =>. Turning an R => into a T => is doable as long as you have a way to identify the R parts of any T, and ignore the rest. If you know your images are grayscale, just use any of the RGB channels. Therefore, working with R by way of T is easy if both sides are in on it. If only one side is in on it, it's either scraping or SEO.

But neither applies to arbitrary processing blocks T => T that need to mutually interoperate. If A throws away some of the data it received before sending it to B, and then B throws away other parts before sending it back to A, little will be left. For reliable operation, either A → B → A or B → A → B ought to be a clean round-trip. Ideally, both. Just try to tell a user you preserve <100% of their data every time they do something.

Consider interoperating with e.g. Adobe Photoshop. A Photoshop file isn't just an image, it's a collection of image layers, vector shapes, external resources and filters. These are combined into a layer tree, which specifies how the graphics ought to be combined. This can involve arbitrary nesting, with each layer having unique blend modes and overlaid effects. Photoshop's core here acts like a kernel in the OS sense, providing a base data model and surrounding services. It's responsible for maintaining the mixed raster/vector workspace of the layered image. The associated "user space" is the drawing tools and inspectors.

Being mutually compatible with Photoshop means being a PSD => PSD back-end, which is equivalent to re-implementing all the "kernel space" concepts. Changing a single parameter or pixel requires re-composing the entire layer stack, so you must build a kernel or engine that can do all the same things.

Also, let's be honest here. The average contemporary dev eyes legacy desktop software somewhat with suspicion. Sure, it's old and creaky, and their toolbars are incredibly out of fashion. But they get the job done, and come with decades of accumulated, deep customizability. The entrenched competition is stiff.

This reflects what I call the Kernel Problem. If you have a processing kernel revolving around an arbitrary T => T block, then the input T must be more like a program than data. It's not just data and metadata, it's also instructions. This means there is only one correct way to interpret them, aside from differences in fidelity or performance. If you have two such kernels which are fully interoperable in either direction, then they must share the same logic on the inside, at least up to equivalence.

If you are trying to match an existing kernel T => T's features in your S => S, your S must be at least as expressive as their original T. To do more, every T must also be a valid S. You must be the Animal to their Duck, not a Duck to their Animal, which makes this sort of like reverse inheritance: you adopt all their code but can then only add non-conflicting changes, so as to still allow for real substitution. A concrete illustration is what "Linux Subsystem for Windows" actually means in practice: put a kernel inside the kernel, or reimplement it 1-to-1. It's also how browsers evolved over time, by adding, not subtracting.

Therefor, I would argue an "extensible kernel" is in the broad sense an oxymoron, like an "extensible foundation" of a building. The foundation is the part that is supposed to support everything else. Its purpose is to enable vertical extension, not horizontal.

If you expand a foundation without building anything on it, it's generally considered a waste of space. If you try to change a foundation underneath people, they rightly get upset. The work isn't done until the building actually stands. If you keep adding on to the same mega-building, maintenance and renewal become impossible. The proper solution for that is called a town or a city.

Naturally kernels can have plug-ins too, so you can wonder if that's actually a "motte user-land" or not. What's important is to notice the dual function. A kernel should enable and support things, by sticking to the essentials and being solid. At the same time, it needs to also ship with a useful toolset working with a single type T that behaves extensibly: it must support arbitrary programs with access to processes, devices, etc.

If extensibility + time = kitchen sink bloat, how do you counter entropy?

You must anticipate, by designing even your core T itself to be decomposable and hence opt-in à la carte. A true extensible kernel is therefor really a decomposable kernel, or perhaps a kernel generator, which in the limit becomes a microkernel. This applies whether you are talking about Photoshop or Linux. You must build it so that it revolves around an A & B & C & ..., so that both A => A and B => B can work directly on an ABC and implicitly preserve the ABC-ness of the result. If all you need to care about is A or B, you can use them directly in a reduced version of the system. If you use an AB, only its pertinent aspects should be present.

Entity-Component Systems are a common way to do this. But they too have a kernel problem: opting in to a component means adopting a particular system that operates on that type of component. Such systems also have dependency chains, which have to be set up in the right order for the whole to behave right. It is not really A & B but A or B in practice. So in order for two different implementations of A or B to be mutually compatible, they again have to be equivalent. Otherwise you can't replace a Process without replacing all the associated input, or getting unusably different output.

The main effect of à la carte architecture is that it never seems like a good idea to force anyone else to turn their Duck into a Chicken, by adopting all your components. You should instead try to agree on a shared Animal. Any ChickenDuck that you do invent will have a limited action radius. Because other people can decide for themselves whether they truly need to deal with chickens on their own time.

None of this is new, I'm just recapturing old wisdom. It frankly seems weird to use programming terminology to have described this problem, when the one place it is not a big deal is inside a single, comfy programming environment. We do in fact freely import modules à la carte when we code, because our type T is the single environment of our language run-time.

But it's not so rosy. The cautionary tale of Python 2 vs 3: if you mess with the internals and standard lib, it's a different language, no matter how you do it. You still have a kernel everyone depends on and it can take over a decade to migrate a software ecosystem.

Everyone has also experienced the limits of modularity, in the form of overly wrapped APIs and libraries, which add more problems than they solve. In practice, everyone on a team must still agree on one master, built incrementally, where all the types and behavior is negotiated and agreed upon. This is either a formal spec, or a defacto one. If it is refactored, that's just a fork everyone agrees to run with. Again, it's not so much extensible, just perpetually unfinished.

À la carte architecture is clearly necessary but not sufficient on its own. Because there is one more thing that people tend to overlook when designing a schema for data: how a normal person will actually edit the data inside.

Oil and Water

Engineering trumps theoretical models, hence the description of PSD above actually omits one point deliberately.

It turns out, if all you want to do is display a PSD, you don't need to reimplement Photoshop's semantics. Each .PSD contains a pre-baked version of the image, so that you don't need to interpret it. A .PSD is really two file formats in one, a layered PSD and something like a raw PNG. It is not a ChickenDuck but a DuckAnimal. They planned ahead so that the Photoshop format can still work if all you want to be is MSPaint => MSPaint. For example, if you're a printer.

This might lead you to wonder.

Given that PNG is itself extensible, you can imagine a PNG-PSD that does the same thing as a PSD. It contains an ordinary image, with all the Photoshop specific data embedded in a separate PSD section. Wouldn't that be better? Now any app that can read PNG can read PSD, and can preserve the PSD-ness. Except, no. If anyone blindly edits the PNG part of the PNG-PSD, while preserving the PSD data, they produce a file where both are out of sync. What you see now depends on which app reads it. PNG-PSDs would be landmines in a mixed ecosystem.

It's unavoidable: if some of the data in a schema is derived from other data in it, the whole cannot be correctly edited by a "dumb", domain-agnostic editor, because of the Kernel Problem. This is why "single source of truth" should always be the end-goal.

A fully extensible format is mainly just kicking the can down the road, saving all the problems for later. It suggests a bit of a self-serving lie: "Extensibility is for other people." It is a successful business recipe, but a poor engineering strategy. It results in small plug-ins, which are not first class, and not fundamentally changing any baked in assumptions.

But the question isn't whether plug-ins are good or bad. The question is whether you actually want to lock your users of tomorrow into how your code works today. You really don't, not unless you've got something battle-hardened already.

If you do see an extensible system working in the wild on the Input, Process and Output side, that means it's got at least one defacto standard driving it. Either different Inputs and Outputs have agreed to convert to and from the same intermediate language... or different middle blocks have agreed to harmonize data and instructions the same way.

This must either flatten over format-specific nuances, or be continually forked to support every new concept being used. Likely this is a body of practices that has mostly grown organically around the task at hand. Given enough time, you can draw a boundary around a "kernel" and a "user land" anywhere. To make this easier, a run-time can help do auto-conversion between versions or types. But somebody still has to be doing it.

This describes exactly what happened with web browsers. They cloned each other's new features, while web developers added workarounds for the missing pieces. Not to make it work differently, but to keep it all working exactly the same. Eventually people got fed up and just adopted a React-like.

That is, you never really apply extensibility on all three fronts at the same time. It doesn't make sense: arbitrary code can't work usefully on arbitrary data. The input and output need to have some guarantees about the process, or vice versa.

Putting data inside a free-form key/value map doesn't change things much. It's barely an improvement over having a unknownData byte[] mix-in on each native type. It only pays off if you actually adopt a decomposable model and stick with it. That way the data is not unknown, but always provides a serviceable view on its own. Arguably this is the killer feature of a dynamic language. The benefit of "extensible data" is mainly "fully introspectable without recompilation."

The success of JSON is an obvious example here. The limited set of types means it can be round-tripped cleanly into most languages. Despite its shortcomings, it can go anywhere text can, and that includes every text editor too. Some find it distasteful to e.g. encode a number as text, but the more important question is: will anyone ever be editing this number by hand or not?

The issue of binary data can be mitigated with a combo of JSON manifest + binary blob, such as in GLTF 3D models. The binary data is packed arrays, ready for GPU consumption. This is a good example of a practical composition A & B. It lets people reuse both the code and practices they know. The only real 'parsing' needed is the slicing of a buffer at known offsets. It also acts as a handy separation between light-weight metadata and heavy-duty data, useful in a networked environment. The format allows for both to be packed in one file, but it's not required.

For a given purpose, you need a well-defined single type T that sets the ground rules for both data and code, which means T must be a common language. It must be able to work equally well as an A, B and C, which are needs that must have been anticipated. Yet it should be built such that you can just use a D of your own, without inconvenient dependency. The key quality to aim for is not creativity but discipline.

If you can truly substitute a type with something else everywhere, it can't be arbitrarily extended or altered, it must retain the exact same interface. In the real world, that means it must actually do the same thing, only marginally better or in a different context. A tool like ffmpeg only exists because we invented a bajillion different ways to encode the same video, and the only solution is to make one thing that supports everything. It's the Unicode of video.

If you extend something into a new type, it's not actually a substitute, it's a fork trying to displace the old standard. As soon as it's used, it creates a data set that follows a new standard. Even when you build your own parsers and/or serializers, you are inventing a spec of your own. Somebody else can do the same thing to you, and that somebody might just be you 6 months from now. Being a programmer means being an archivist-general for the data your code generates.

* * *

If you actually think about it, extensibility and substitution are opposites in the design space. You must not extend, you must decompose, if you wish to retain the option of substituting it with something simpler yet equivalent for your needs. Because the other direction is one-way only, only ever adding complexity, which can only be manually refactored out again.

If someone is trying to sell you on something "extensible," look closely. Is it actually à la carte? Does it come with a reasonable set of proven practices on how to use it? If not, they are selling you a fairy tale, and possibly themselves too. They haven't actually made it reusable yet: if two different people started using it to solve the same new problem, they would not end up with compatible solutions. You will have 4 standards: the original, the two new variants, and the attempt to resolve all 3.

Usually it is circumstance, hierarchy and timing that decides who adapts their data and their code to whom, instead of careful consideration and iteration. Conway's law reigns, and most software is shaped like the communication structure of the people who built it. "Patch," "minor" or "major" release is just the difference between "Pretty please?", "Right?" and "I wasn't asking."

We can do a lot better. But the mindset it requires at this point is not extensibility. The job at hand is salvage.

If you want a concise example of how to do this right, check out Lottie. It's an animation system for the web, which is fed by an Adobe AfterEffects export plug-in. This means that animators can use the same standard tools they are familiar with. The key trick here is to reduce complex AfterEffects data to a much simpler animation model: that of eased bezier curves. Not all of AfterEffects' features are supported, but all the essentials do work. So Lottie is a Bezier => Bezier back-end, fed by an AfterEffects => Bezier converter.

It's so sensible, that when I was asked how to build an animation system in Rust, I suggested doing exactly the same thing. So we did.

Headless React

2021-07-06T00:00:00+02:00

Part 1: Climbing Mount Effect - Declarative Code and Effects

Part 2: Reconcile All The Things - Memoization, Data Flow and reconciliation

Part 3: Headless React - Live, Yeet Reduce, No-API, WebGPU

Live

It is actually pretty easy to build a mediocre headless React today, i.e. an implementation of React that isn't hooked directly into anything else.

react-reconciler is an official package that lets you hook up React to anything already. That's how both React-DOM and React-Native share a run-time.

Most third-party libraries that use it (like react-three-fiber) follow the same approach. They are basically fully wrapped affairs: each notable Three.js object (mesh, geometry, material, light, ...) will tend to have a matching node in the React tree. Three.js has its own scene tree, like the browser has a DOM, so react-reconciler will sync up the two trees one-to-one.

The libraries need to do this, because the target is a retained in-memory model. It must be mutated in-place, and then re-drawn. But what would it look like to target an imperative API directly, like say 2D Canvas?

You can't just call an imperative API directly in a React component, because the idea of React is to enable minimal updates. There is no guarantee every component that uses your imperative API will actually be re-run as part of an update. So you still need a light-weight reconciler.

Implementing your own back-end to the reconciler is a bit of work, but entirely doable. You build a simple JS DOM, and hook React into that. It doesn't even need to support any of the fancy React features, or legacy web cruft: you can stub it out with no-ops. Then you can make up any tags you like, with any JS value as a property, and have React reconcile them.

Then if you want to turn something imperative into something declarative, you can render elements with an ordinary render prop like this:

{ context.fillStyle = "blue"; context.drawRect(/*...*/); }} />

This code doesn't run immediately, it just captures all the necessary information from the surrounding scope, allowing somebody else to call it. The reconciler will gather these multiple "native" elements into a shallow tree. They can then be traversed and run, to form a little ad-hoc program. In other words, it's an Effect-like model again, just with all the effects neatly arranged and reconciled ahead of time. Compared to a traditional retained library, it's a lot more lightweight. It can re-paint without having to re-render any Components in React.

You can also add synthetic events like in React-DOM. These can be forwarded with conveniences like event.stopPropagation() replicated.

I've used this with great success before. Unfortunately I can't show the results here—maybe in the future—but I do have something else that should demonstrate the same value proposition.

React works hard to synchronize its own tree with a DOM-like tree, but it's just a subset of the tree it already has. If you remove that second tree, what's left? Does that one tree still do something useful by itself?

I wagered that it would and built a version of it. It's pretty much just a straight up re-implementation of React's core pattern, from the ground up. It has some minor tweaks and a lot of omissions, but all the basics of hook-driven React are there. More importantly, it has one extra superpower: it's designed to let you easily collect lambdas. It's still an experiment, but the parts that are there seem to work fine already. It also has tests.

Yeet Reduce

As we saw, a reconciler derives all its interesting properties from its one-way data flow. It makes it so that the tree of mounted components is also the full data dependency graph.

So it seems like a supremely bad idea to break it by introducing arbitrary flow the other way. Nevertheless, it seems clear that we have two very interesting flavors just asking to be combined: expanding a tree downstream to produce nodes in a resumable way, and yielding values back upstream in order to aggregate them.

Previously I observed that trying to use a lambda in a live DFG is equivalent to potentially creating new outputs out of thin air. Changing part of a graph means it may end up having different outputs than before. The trick is then to put the data sinks higher up in the tree, instead of at the leaves. This can be done by overlaying a memoized map-reducer which is only allowed to pass things back in a stateless way.

The resulting data flow graph is not in fact a two-way tree, which would be a no-no: it would have a cycle between every parent and child. Instead it is a DFG consisting of two independent copies of the same tree, one forwards, one backwards, glued together. Though in reality, the second half is incomplete, as it only needs to include edges and nodes leading back to a reducer.

Thus we can memoize both the normal forward pass of generating nodes and their sinks, as well as the reverse pass of yielding values back to them. It's two passes of DFG, one expanding, one contracting. It amplifies input in the first half by generating more and more nodes. But it will simultaneously install reducers as the second half to gather and compress it back into a collection or a summary.

When we memoize a call in the forward direction, we will also memoize the yield in the other direction. Similarly, when we bust a cache on the near side, we also bust the paired cache on the far side, and keep busting all the way to the end. That's why it's called Yeet Reduce. Well that and yield is a reserved keyword.

What's also not obvious is that this process can be repeated: after a reduction pass is complete, we can mount a new fiber that receives the result as input. As such, the data flow graph is not a single expansion and contraction, but rather, many of them, separated by a so-called data fence.

This style of coding is mainly suited for use near the top of an application's data dependency graph, or in a shallow sub-tree, where the number of nodes in play is typically a few dozen. When you have tons of tiny objects instead, you want to rely on data-level parallelism rather than mounting each item individually.

Horse JS

I used to think a generalized solution for memoized data flow would be something crazy and mathematical. The papers I read certainly suggested so, pushing towards the equivalent of automatic differentiation of any code. It would just work. It would not require me to explicitly call memo on and in every single Component. It should not impose weird rules banning control flow. It would certainly not work well with non-reactive code. And so on.

There seemed to be an unbridgeable gap between a DFG and a stack machine. This meant that visual, graph-based coding tools would always be inferior in their ability to elegantly capture Turing-complete programs.

Neither seems to be the case. For one, having to memoize things by hand doesn't feel wrong in the long run. A minimal recomputation doesn't necessarily mean a recomputation that is actually small and fast. It feels correct to make it legible exactly how often things will change in your code, as a substitute for the horrible state transitions of old. Caching isn't always a net plus either, so fully memoized code would just be glacial for real use cases. That's just how the memory vs CPU trade-off falls these days.

That said, declaring dependencies by hand is annoying. You need linter rules for it because even experienced engineers occasionally miss a dep. Making a transpiler do it or adding it into the language seems like a good idea, at least if you could still override it. I also find syntax is only convenient for quickly nesting static inside other . Normal JS {object} syntax is often more concise, at least when the keys match the names. Once you put a render prop in there, JSX quickly starts looking like Lisp with a hangover.

When your Components are just resources and effects instead of widgets, it feels entirely wrong that you can't just write something like:

live (arg) => { let [service, store] = mount [ Service(...), Store(...), ]; }

Without any JSX or effect-like wrappers. Here, mount would act somewhat like a reactive version of the classic new operator, with a built-in yield, except for fiber-mounted Components instead of classes.

I also have to admit to being sloppy here. The reason you can think of a React component as an Effect is because its ultimate goal is to create e.g. an HTML DOM. Whatever code you run exists, in theory, mostly to generate that DOM. If you take away that purpose, suddenly you have to be a lot more conscious of whether a piece of code can actually be skipped or not, even if it has all the same inputs as last time.

This isn't actually as simple as merely checking if a piece of code is side-effect free: when you use declarative patterns to interact with stateful code, like a transaction, it is still entirely contextual whether that transaction needs to be repeated, or would be idempotent and can be skipped. That's the downside of trying to graft statelessness onto legacy tech, which also requires some mutable water in your immutable wine.

I did look into writing a Babel parser for a JS/TS dialect, but it turns out the insides are crazy and it takes three days just to make it correctly parse live / mount with the exact same rules as async / await. That's because it's a chain of 8 classes, each monkey patching the previous one's methods, creating a flow that's impractical to trace step by step. Tower of Babel indeed. It's the perfect example to underscore this entire article series with.

It also bothers me that each React hook is actually pretty bad from a garbage collection point of view:

const memoized = useMemo(() => slow(foo), [foo]);

This will allocate both a new dependency array [foo] and a new closure () => slow(foo). Even if nothing has changed and the closure is not called. This is unavoidable if you want this to remain a one-liner JS API. An impractical workaround would be to split up and inline useMemo into into its parts which avoid all GC:

// One useMemo() call let memoized; { useMemoNext(); useMemoPushDependency(foo); memoized = useMemoSameDependencies() ? useMemoValue() : slow(foo); }

But a language with a built-in reconciler could actually be quite efficient on the assembly level. Dependencies could e.g. be stored and checked in a double buffered arrangement, alternating the read and write side.

I will say this: React has done an amazing job. It got popular because its Virtual DOM finally made HTML sane to work with again. But what it actually was in the long run, was a Trojan horse for Lisp-like thinking and a move towards Effects.

No-API

So, headless React works pretty much exactly as described. Except, without the generators, because JS generators are stateful and not rewindable/resumable. So for now I have to write my code in the promise.then(…) style instead of using a proper yield.

I tried to validate it by using WebGPU as a test case, building out a basic set of composable components. First I hid the uglier parts of the WebGPU API inside some pure wrappers (the makeFoo(...) calls below) for conciseness. Then I implemented a blinking cube like this:

export const Cube: LiveComponent = memo((fiber) => (props) => { const { device, colorStates, depthStencilState, defs, uniforms, compileGLSL } = props; // Blink state, flips every second const [blink, setBlink] = useState(0); useResource((dispose) => { const timer = setInterval(() => { setBlink(b => 1 - b); }, 1000); dispose(() => clearInterval(timer)); }); // Cube vertex data const cube = useOne(makeCube); const vertexBuffers = useMemo(() => makeVertexBuffers(device, cube.vertices), [device]); // Rendering pipeline const pipeline = useMemo(() => { const pipelineDesc: GPURenderPipelineDescriptor = { primitive: { topology: "triangle-list", cullMode: "back", }, vertex: makeShaderStage( device, makeShader(compileGLSL(vertexShader, 'vertex')), {buffers: cube.attributes} ), fragment: makeShaderStage( device, makeShader(compileGLSL(fragmentShader, 'fragment')), {targets: colorStates} ), depthStencil: depthStencilState, }; return device.createRenderPipeline(pipelineDesc); }, [device, colorStates, depthStencilState]); // Uniforms const [uniformBuffer, uniformPipe, uniformBindGroup] = useMemo(() => { const uniformPipe = makeUniforms(defs); const uniformBuffer = makeUniformBuffer(device, uniformPipe.data); const entries = makeUniformBindings([{resource: {buffer: uniformBuffer}}]); const uniformBindGroup = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries, }); return ([uniformBuffer, uniformPipe, uniformBindGroup] as [GPUBuffer, UniformDefinition, GPUBindGroup]); }, [device, defs, pipeline]); // Return a lambda back to parent(s) return yeet((passEncoder: GPURenderPassEncoder) => { // Draw call uniformPipe.fill(uniforms); uploadBuffer(device, uniformBuffer, uniformPipe.data); passEncoder.setPipeline(pipeline); passEncoder.setBindGroup(0, uniformBindGroup); passEncoder.setVertexBuffer(0, vertexBuffers[0]); passEncoder.draw(cube.count, 1, 0, 0); }); }

This is 1 top-level function, with zero control flow, and a few hooks. The cube has a state (blink), that it decides to change on a timer. Here, useResource is like a sync useEffect which the runtime will manage for us. It's not pure, but very convenient.

All the external dependencies are hooked up, using the react-like useMemo hook and its mutant little brother useOne (for 0 or 1 dependency). This means if the WebGPU device were to change, every variable that depends on it will be re-created on the next render. The parts that do not (e.g. the raw cube data) will be reused.

This by itself is remarkable to me: to be able to granularly bust caches like this deep inside a program, written in purely imperative JS, that nevertheless is almost a pure declaration of intent. When you write code like this, you focus purely on construction, not on mutation. It also lets you use an imperative API directly, which is why I refer to this as "No API": the only wrappers are those which you want to add yourself.

Notice the part at the end: I'm not actually yeeting a real draw command. I'm just yeeting a lambda that will insert a draw command into a vanilla passEncoder from the WebGPU API. It's these lambdas which are reduced together in this sub-tree. These can then just be run in tree order to produce the associated render pass.

What's more, the only part of the entire draw call that actually changes regularly is the GPU uniform values. This is why uniforms is not an immutable object, but rather an immutable reference with mutable registers inside. In react-speak it's a ref, aka a pointer. This means if only the camera moves, the Cube component does not need to be re-evaluated. No lambda is re-yeeted, and nothing is re-reduced. The same code from before would keep working.

Therefor the entirety of Cube() is wrapped in a memo(...). It memoizes the entire Component in one go using all the values in props as the dependencies. If none of them changed, no need to do anything, because it cannot have any effect by construction. The run-time takes advantage of this by not re-evaluating any children of a successfully memoized node, unless its internal state changed.

The very top of the (reactive) part is:

export const App: LiveComponent = () => (props) => { const {canvas, device, adapter, compileGLSL} = props; return use(AutoCanvas)({ canvas, device, adapter, render: (renderContext: CanvasRenderingContextGPU) => { const { width, height, gpuContext, colorStates, colorAttachments, depthStencilState, depthStencilAttachment, } = renderContext; return use(OrbitControls)({ canvas, render: (radius: number, phi: number, theta: number) => use(OrbitCamera)({ canvas, width, height, radius, phi, theta, render: (defs: UniformAttribute[], uniforms: ViewUniforms) => use(Draw)({ device, gpuContext, colorAttachments, children: [ use(Pass)({ device, colorAttachments, depthStencilAttachment, children: [ use(Cube)({device, colorStates, depthStencilState, compileGLSL, defs, uniforms}), ] }) ], }) }) }); } }); };

This is a poor man's JSX, but also not actually terrible. It may not look like much, but, pretty much everyone who's coded any GL, Vulkan, etc. has written a variation of this.

This tree composes things that are completely heterogeneous: a canvas auto-sizer, interactive controls, camera uniforms, frame buffer attachments, and more, into one neat, declarative structure. This is quite normal in React-land these days. The example above is static to keep things simple, but it doesn't need to be, that's the point.

The nicest part is that unlike in a traditional GPU renderer, it is trivial for it to know exactly when to re-paint the image or not. Even those mutable uniforms come from a Live component, the effects of which are tracked and reconciled: OrbitCamera takes mutable values and produces an immutable container ViewUniforms.

You get perfect battery-efficient sparse updates for free. It's actually more work to get it to render at a constant 60 fps, because for that you need the ability to independently re-evaluate a subtree during a requestAnimationFrame(). I had to explicitly add that to the run-time. It's around 1100 lines now, which I'm happy with.

Save The Environment

If it still seems annoying to have to pass variables like device into everything, there's the usual solution: context providers, aka environments, which act as invisible skip links across the tree:

export const GPUDeviceContext = makeContext(); export const App: LiveComponent = () => (props) => { const {canvas, device, adapter, compileGLSL} = props; return provide(GPUDeviceContext, device, use(AutoCanvas)({ /*...*/ }) ); } export const Cube: LiveComponent = memo((fiber) => (props) => { const device = useContext(GPUDeviceContext); /* ... */ }

You also don't need to pass one variable at a time, you can pass arbitrary structs.

In this situation it is trickier for the run-time to track changes, because you may need to skip past a memo(…) parent that didn't change. But doable.

Yeet-reduce is also a generalization of the chunking and clustering processes of a modern compute-driven renderer. That's where I got it from anyway. Once you move that out, and make it a native op on the run-time, magic seems to happen.

This is remarkable to me because it shows you how you can wrap, componentize and memoize a completely foreign, non-reactive API, while making it sing and dance. You don't actually have to wrap and mount a for every WebGPUThing that exists, which is the popular thing to do. You don't need to do O(N) work to control the behavior of N foreign concepts. You just wrap the things that make your code more readable. The main thing something like React provides is a universal power tool for turning things off and on again: expansion, memoization and reconciliation of effects. Now you no longer need to import React and pretend to be playing DOM-jot either.

The only parts of the WebGPU API that I needed to build components for to pull this off, were the parts I actually wanted to compose things with. This glue is so minimal it may as well not be there: each of AutoSize, Canvas, Cube, Draw, OrbitCamera, OrbitControls and Pass is 1 reactive function with some hooks inside, most of them half a screen.

I do make use of some non-reactive WebGPU glue, e.g. to define and fill binary arrays with structured attributes. Those parts are unremarkable, but you gotta do it.

If I now generalize my Cube to a generic Mesh, I have the basic foundation of a fully declarative and incremental WebGPU toolkit, without any OO. The core components look the same as the ones you'd actually build for yourself on the outside. Its only selling point is a supernatural ability to get out of your way, which it learnt mainly from React. It doesn't do anything else. It's great when used to construct the outside of your program, i.e. the part that loads resources, creates workloads, spawns kernels, and so on. You can use yeet-reduce on the inside to collect lambdas for the more finicky stuff, and then hand the rest of the work off to traditional optimized code or a GPU. It doesn't need to solve all your problems, or even know what they are.

I should probably reiterate: this is not a substitute for typical data-level parallelism, where all your data is of the exact same type. Instead it's meant for composing highly heterogeneous things. You will still want to call out to more optimized code inside to do the heavy lifting. It's just a lot more straightforward to route.

For some reason, it is incredibly difficult to get this across. Yet algorithmically there is nothing here that hasn't been tried before. The main trick is just engineering these things from the point of view of the person who actually has to use it: give them the same tools you'd use on the inside. Don't force them to go through an interface if there doesn't need to be one.

The same can be said for React and Live, naturally. If you want to get nerdy about it, the reconciler can itself be modeled as a live effect. Its actions can themselves become regular nodes in the tree. If there were an actual dialect with a real live keyword, and WeakMaps on steroids, that would probably be doable. In the current implementation, it would just slow things down.

Throughout this series, I've used Javascript syntax as a lingua franca. Some might think it's insane to stretch the language to this point, when more powerful languages exist where effects fit more natively into the syntax and the runtime. I think it's better to provide stepping stones to actually get from here to there first.

I know that once you have gone through the trouble of building O(N²) lines of code for something, and they work, the prospect of rewriting all of them can seem totally insane. It probably won't be as optimized on the micro-level, which in some domains does actually still matter, even in this day and age. But how big is that N? It may actually be completely worth it, and it may not take remotely as long as you think.

As for me, all I had to do was completely change the way I structure all my code, and now I can finally start making proper diagrams.

Source code on GitLab.