Compute to Render
Using compute output in rendering
The Compute-Render Pipeline
Compute shaders generate data. Render pipelines draw it. The real power emerges when you connect them—compute writes to buffers that render reads, creating a seamless flow from parallel computation to pixels on screen.
This pattern appears everywhere in modern graphics. Particle systems spawn and simulate thousands of particles in compute, then render them as points or quads. Terrain generators compute height maps and normals in parallel, feeding the results to vertex shaders. Animation systems update skeletal transforms in compute before the mesh is drawn.
Compute to Render Data Flow
Data flows from compute to render through shared buffers. Hover over stages to see details. The storage buffer serves as both compute output and vertex input.
The key insight is that buffers are shared. A buffer created with both STORAGE and VERTEX usage can be written by a compute shader and read as vertex data in the same frame. The GPU's command queue handles synchronization—submit compute work, then submit render work, and the driver ensures correct ordering.
Storage Buffers as Vertex Data
The simplest compute-to-render pattern uses a storage buffer as a vertex buffer. Compute fills the buffer with vertex positions, and the render pipeline draws them.
// Compute shader: generate a wave of points
struct Vertex {
@location(0) position: vec3<f32>,
@location(1) color: vec3<f32>,
}
@group(0) @binding(0) var<storage, read_write> vertices: array<Vertex>;
@group(0) @binding(1) var<uniform> params: Params;
@compute @workgroup_size(64)
fn generate_wave(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
if (idx >= params.vertex_count) { return; }
let x = f32(idx) / f32(params.vertex_count) * 2.0 - 1.0;
let y = sin(x * 10.0 + params.time) * 0.3;
vertices[idx].position = vec3<f32>(x, y, 0.0);
vertices[idx].color = vec3<f32>(
0.5 + 0.5 * sin(params.time),
0.5 + 0.5 * cos(params.time * 0.7),
0.8
);
}// Vertex shader: read the computed vertices
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) color: vec3<f32>,
}
struct VertexOutput {
@builtin(position) position: vec4<f32>,
@location(0) color: vec3<f32>,
}
@vertex
fn vs_main(input: VertexInput) -> VertexOutput {
var output: VertexOutput;
output.position = vec4<f32>(input.position, 1.0);
output.color = input.color;
return output;
}The buffer must be created with both usages:
const vertexBuffer = device.createBuffer({
size: vertexCount * (3 + 3) * 4, // position + color, f32 each
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.VERTEX,
});Each frame: dispatch compute to update vertex data, then render using that buffer. The GPU automatically handles the synchronization—compute results are visible when the render pass begins.
Interactive: Procedural Geometry
Simulates compute generating vertex positions and colors, then rendering as a line strip with points. In real WebGPU, this would run entirely on GPU.
Procedural Geometry
Compute shaders excel at generating geometry procedurally. Instead of loading static meshes, you generate them on the fly—adapting to parameters, player position, or simulation state.
Consider a simple terrain heightmap. A compute shader samples noise at each grid position, computing vertex positions and normals:
struct TerrainVertex {
position: vec3<f32>,
normal: vec3<f32>,
}
@group(0) @binding(0) var<storage, read_write> terrain: array<TerrainVertex>;
@group(0) @binding(1) var<uniform> params: TerrainParams;
@compute @workgroup_size(8, 8)
fn generate_terrain(@builtin(global_invocation_id) global_id: vec3<u32>) {
let x = global_id.x;
let z = global_id.y;
if (x >= params.width || z >= params.depth) { return; }
let idx = z * params.width + x;
let world_x = f32(x) * params.scale;
let world_z = f32(z) * params.scale;
// Sample noise for height
let height = fbm_noise(vec2<f32>(world_x, world_z) * 0.1) * params.amplitude;
terrain[idx].position = vec3<f32>(world_x, height, world_z);
// Compute normal from neighbors (simplified)
let dx = fbm_noise(vec2<f32>(world_x + 0.1, world_z) * 0.1) * params.amplitude - height;
let dz = fbm_noise(vec2<f32>(world_x, world_z + 0.1) * 0.1) * params.amplitude - height;
terrain[idx].normal = normalize(vec3<f32>(-dx, 1.0, -dz));
}The advantage over CPU generation is massive. A 512×512 terrain requires over 262,000 vertices. On the CPU, generating this takes tens of milliseconds. On the GPU, it completes in under a millisecond.
Indirect Drawing
Indirect drawing takes compute-to-render further: compute not only generates vertex data but also determines how much to draw.
Normal draw calls specify the vertex count as a fixed number:
renderPass.draw(1000); // Always draw 1000 verticesWith indirect drawing, these parameters come from a buffer:
// Indirect buffer structure for draw()
// [vertexCount, instanceCount, firstVertex, firstInstance]
const indirectBuffer = device.createBuffer({
size: 16,
usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.STORAGE,
});
// Later, in the render pass
renderPass.drawIndirect(indirectBuffer, 0);Why does this matter? Because compute can decide at runtime how many primitives to render. Consider a particle system where particles can die:
@group(0) @binding(0) var<storage, read_write> particles: array<Particle>;
@group(0) @binding(1) var<storage, read_write> indirect: IndirectArgs;
@group(0) @binding(2) var<storage, read_write> aliveCount: atomic<u32>;
struct IndirectArgs {
vertexCount: u32,
instanceCount: u32,
firstVertex: u32,
firstInstance: u32,
}
@compute @workgroup_size(256)
fn update_particles(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
if (idx >= arrayLength(&particles)) { return; }
var p = particles[idx];
// Update particle
p.velocity.y -= 0.01; // Gravity
p.position += p.velocity;
p.life -= 0.016; // Decrease lifetime
if (p.life > 0.0) {
// Particle survives - claim an output slot
let outIdx = atomicAdd(&aliveCount, 1u);
particles[outIdx] = p; // Compact alive particles
}
}
@compute @workgroup_size(1)
fn prepare_indirect() {
let count = atomicLoad(&aliveCount);
indirect.vertexCount = 4u; // Vertices per particle quad
indirect.instanceCount = count; // Number of alive particles
indirect.firstVertex = 0u;
indirect.firstInstance = 0u;
}Interactive: Indirect Draw
The compute shader determines instanceCount based on alive particles. The render pass reads this from the indirect buffer, drawing only what exists. No CPU round-trip needed.
Without indirect drawing, you would need to read the particle count back to the CPU, then issue a draw call with that count. This CPU-GPU round-trip adds latency and kills performance. With indirect drawing, everything stays on the GPU.
Pipeline Barriers
When compute and render share buffers, you must ensure correct ordering. WebGPU handles most synchronization automatically through its command submission model, but understanding the underlying requirements helps you structure code correctly.
The key rule: all commands in a command encoder execute in order, and commands from different encoders execute in submission order.
// Correct ordering
const computeEncoder = device.createCommandEncoder();
const computePass = computeEncoder.beginComputePass();
computePass.setPipeline(computePipeline);
computePass.setBindGroup(0, computeBindGroup);
computePass.dispatchWorkgroups(workgroupCount);
computePass.end();
const renderEncoder = device.createCommandEncoder();
const renderPass = renderEncoder.beginRenderPass(renderPassDescriptor);
renderPass.setPipeline(renderPipeline);
renderPass.setVertexBuffer(0, sharedBuffer); // Same buffer
renderPass.draw(vertexCount);
renderPass.end();
// Submit in order: compute first, then render
device.queue.submit([computeEncoder.finish(), renderEncoder.finish()]);WebGPU guarantees that when the render pass begins, compute writes to sharedBuffer are complete and visible.
You can also use a single command encoder with multiple passes. The passes execute sequentially:
const encoder = device.createCommandEncoder();
// Compute pass first
const computePass = encoder.beginComputePass();
// ... dispatch compute work
computePass.end();
// Render pass second (automatically waits for compute)
const renderPass = encoder.beginRenderPass(descriptor);
// ... draw using compute output
renderPass.end();
device.queue.submit([encoder.finish()]);Particle Systems Preview
Particle systems are the classic compute-to-render application. Thousands of particles, each with position, velocity, color, and lifetime, updated every frame in parallel.
Particle System Preview
A preview of GPU particle systems. In production, all simulation runs in compute shaders, with indirect draw calls rendering only alive particles. The full implementation is covered in the Simulation section.
The full implementation (covered in the Simulation section) involves:
- Spawn compute shader: Creates new particles, writing to available slots
- Update compute shader: Applies physics, ages particles, removes dead ones
- Compact compute shader: Optionally packs alive particles contiguously
- Prepare indirect: Sets draw call parameters based on alive count
- Render pipeline: Draws particles as point sprites or billboards
Each step reads from and writes to shared buffers. The entire simulation runs without CPU involvement beyond submitting the command buffer.
Design Considerations
When building compute-to-render systems, consider:
Buffer sizing: Compute might produce variable amounts of output. Size buffers for the worst case, use indirect drawing for the actual case.
Memory layout: Vertex buffers need specific layouts matching the vertex shader attributes. Plan the struct layout before writing compute code.
Double buffering: For effects that read previous frame state (like simulations), use two buffers and swap each frame. This prevents read-write hazards.
Debugging: Compute errors are silent. Add validation checks, use timestamp queries to profile, and start with small test cases.
Key Takeaways
- Shared buffers connect compute and render—write in compute, read in render
- Buffers need both
STORAGEandVERTEX(orINDEX) usage flags - Indirect drawing lets compute determine draw call parameters, avoiding CPU round-trips
- WebGPU handles synchronization automatically through command submission order
- Particle systems, procedural geometry, and animation are classic use cases
- The entire GPU frame can run without CPU intervention beyond command submission