Compute to Render

Using compute output in rendering

The Compute-Render Pipeline

Compute shaders generate data. Render pipelines draw it. The real power emerges when you connect them—compute writes to buffers that render reads, creating a seamless flow from parallel computation to pixels on screen.

This pattern appears everywhere in modern graphics. Particle systems spawn and simulate thousands of particles in compute, then render them as points or quads. Terrain generators compute height maps and normals in parallel, feeding the results to vertex shaders. Animation systems update skeletal transforms in compute before the mesh is drawn.

Compute to Render Data Flow

Data flows from compute to render through shared buffers. Hover over stages to see details. The storage buffer serves as both compute output and vertex input.

The key insight is that buffers are shared. A buffer created with both STORAGE and VERTEX usage can be written by a compute shader and read as vertex data in the same frame. The GPU's command queue handles synchronization—submit compute work, then submit render work, and the driver ensures correct ordering.

Storage Buffers as Vertex Data

The simplest compute-to-render pattern uses a storage buffer as a vertex buffer. Compute fills the buffer with vertex positions, and the render pipeline draws them.

// Compute shader: generate a wave of points
struct Vertex {
    @location(0) position: vec3<f32>,
    @location(1) color: vec3<f32>,
}
 
@group(0) @binding(0) var<storage, read_write> vertices: array<Vertex>;
@group(0) @binding(1) var<uniform> params: Params;
 
@compute @workgroup_size(64)
fn generate_wave(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let idx = global_id.x;
    if (idx >= params.vertex_count) { return; }
    
    let x = f32(idx) / f32(params.vertex_count) * 2.0 - 1.0;
    let y = sin(x * 10.0 + params.time) * 0.3;
    
    vertices[idx].position = vec3<f32>(x, y, 0.0);
    vertices[idx].color = vec3<f32>(
        0.5 + 0.5 * sin(params.time),
        0.5 + 0.5 * cos(params.time * 0.7),
        0.8
    );
}

wgsl

// Vertex shader: read the computed vertices
struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) color: vec3<f32>,
}
 
struct VertexOutput {
    @builtin(position) position: vec4<f32>,
    @location(0) color: vec3<f32>,
}
 
@vertex
fn vs_main(input: VertexInput) -> VertexOutput {
    var output: VertexOutput;
    output.position = vec4<f32>(input.position, 1.0);
    output.color = input.color;
    return output;
}

wgsl

The buffer must be created with both usages:

const vertexBuffer = device.createBuffer({
    size: vertexCount * (3 + 3) * 4, // position + color, f32 each
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.VERTEX,
});

javascript

Each frame: dispatch compute to update vertex data, then render using that buffer. The GPU automatically handles the synchronization—compute results are visible when the render pass begins.

Interactive: Procedural Geometry

Points100

Frequency4

Amplitude0.3

Simulates compute generating vertex positions and colors, then rendering as a line strip with points. In real WebGPU, this would run entirely on GPU.

Procedural Geometry

Compute shaders excel at generating geometry procedurally. Instead of loading static meshes, you generate them on the fly—adapting to parameters, player position, or simulation state.

Consider a simple terrain heightmap. A compute shader samples noise at each grid position, computing vertex positions and normals:

struct TerrainVertex {
    position: vec3<f32>,
    normal: vec3<f32>,
}
 
@group(0) @binding(0) var<storage, read_write> terrain: array<TerrainVertex>;
@group(0) @binding(1) var<uniform> params: TerrainParams;
 
@compute @workgroup_size(8, 8)
fn generate_terrain(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let x = global_id.x;
    let z = global_id.y;
    
    if (x >= params.width || z >= params.depth) { return; }
    
    let idx = z * params.width + x;
    let world_x = f32(x) * params.scale;
    let world_z = f32(z) * params.scale;
    
    // Sample noise for height
    let height = fbm_noise(vec2<f32>(world_x, world_z) * 0.1) * params.amplitude;
    
    terrain[idx].position = vec3<f32>(world_x, height, world_z);
    
    // Compute normal from neighbors (simplified)
    let dx = fbm_noise(vec2<f32>(world_x + 0.1, world_z) * 0.1) * params.amplitude - height;
    let dz = fbm_noise(vec2<f32>(world_x, world_z + 0.1) * 0.1) * params.amplitude - height;
    terrain[idx].normal = normalize(vec3<f32>(-dx, 1.0, -dz));
}

wgsl

The advantage over CPU generation is massive. A 512×512 terrain requires over 262,000 vertices. On the CPU, generating this takes tens of milliseconds. On the GPU, it completes in under a millisecond.

Indirect Drawing

Indirect drawing takes compute-to-render further: compute not only generates vertex data but also determines how much to draw.

Normal draw calls specify the vertex count as a fixed number:

renderPass.draw(1000); // Always draw 1000 vertices

javascript

With indirect drawing, these parameters come from a buffer:

// Indirect buffer structure for draw()
// [vertexCount, instanceCount, firstVertex, firstInstance]
const indirectBuffer = device.createBuffer({
    size: 16,
    usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.STORAGE,
});
 
// Later, in the render pass
renderPass.drawIndirect(indirectBuffer, 0);

javascript

Why does this matter? Because compute can decide at runtime how many primitives to render. Consider a particle system where particles can die:

@group(0) @binding(0) var<storage, read_write> particles: array<Particle>;
@group(0) @binding(1) var<storage, read_write> indirect: IndirectArgs;
@group(0) @binding(2) var<storage, read_write> aliveCount: atomic<u32>;
 
struct IndirectArgs {
    vertexCount: u32,
    instanceCount: u32,
    firstVertex: u32,
    firstInstance: u32,
}
 
@compute @workgroup_size(256)
fn update_particles(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let idx = global_id.x;
    if (idx >= arrayLength(&particles)) { return; }
    
    var p = particles[idx];
    
    // Update particle
    p.velocity.y -= 0.01;  // Gravity
    p.position += p.velocity;
    p.life -= 0.016;  // Decrease lifetime
    
    if (p.life > 0.0) {
        // Particle survives - claim an output slot
        let outIdx = atomicAdd(&aliveCount, 1u);
        particles[outIdx] = p;  // Compact alive particles
    }
}
 
@compute @workgroup_size(1)
fn prepare_indirect() {
    let count = atomicLoad(&aliveCount);
    indirect.vertexCount = 4u;  // Vertices per particle quad
    indirect.instanceCount = count;  // Number of alive particles
    indirect.firstVertex = 0u;
    indirect.firstInstance = 0u;
}

wgsl

Interactive: Indirect Draw

Indirect Buffer:

vertexCount: 4

instanceCount: 0

Active Particles: 0

Draw Calls: 0

Spawn Rate5

The compute shader determines instanceCount based on alive particles. The render pass reads this from the indirect buffer, drawing only what exists. No CPU round-trip needed.

Without indirect drawing, you would need to read the particle count back to the CPU, then issue a draw call with that count. This CPU-GPU round-trip adds latency and kills performance. With indirect drawing, everything stays on the GPU.

Pipeline Barriers

When compute and render share buffers, you must ensure correct ordering. WebGPU handles most synchronization automatically through its command submission model, but understanding the underlying requirements helps you structure code correctly.

The key rule: all commands in a command encoder execute in order, and commands from different encoders execute in submission order.

// Correct ordering
const computeEncoder = device.createCommandEncoder();
const computePass = computeEncoder.beginComputePass();
computePass.setPipeline(computePipeline);
computePass.setBindGroup(0, computeBindGroup);
computePass.dispatchWorkgroups(workgroupCount);
computePass.end();
 
const renderEncoder = device.createCommandEncoder();
const renderPass = renderEncoder.beginRenderPass(renderPassDescriptor);
renderPass.setPipeline(renderPipeline);
renderPass.setVertexBuffer(0, sharedBuffer);  // Same buffer
renderPass.draw(vertexCount);
renderPass.end();
 
// Submit in order: compute first, then render
device.queue.submit([computeEncoder.finish(), renderEncoder.finish()]);

javascript

WebGPU guarantees that when the render pass begins, compute writes to sharedBuffer are complete and visible.

You can also use a single command encoder with multiple passes. The passes execute sequentially:

const encoder = device.createCommandEncoder();
 
// Compute pass first
const computePass = encoder.beginComputePass();
// ... dispatch compute work
computePass.end();
 
// Render pass second (automatically waits for compute)
const renderPass = encoder.beginRenderPass(descriptor);
// ... draw using compute output
renderPass.end();
 
device.queue.submit([encoder.finish()]);

javascript

Particle Systems Preview

Particle systems are the classic compute-to-render application. Thousands of particles, each with position, velocity, color, and lifetime, updated every frame in parallel.

Particle System Preview

A preview of GPU particle systems. In production, all simulation runs in compute shaders, with indirect draw calls rendering only alive particles. The full implementation is covered in the Simulation section.

The full implementation (covered in the Simulation section) involves:

Spawn compute shader: Creates new particles, writing to available slots
Update compute shader: Applies physics, ages particles, removes dead ones
Compact compute shader: Optionally packs alive particles contiguously
Prepare indirect: Sets draw call parameters based on alive count
Render pipeline: Draws particles as point sprites or billboards

Each step reads from and writes to shared buffers. The entire simulation runs without CPU involvement beyond submitting the command buffer.

Design Considerations

When building compute-to-render systems, consider:

Buffer sizing: Compute might produce variable amounts of output. Size buffers for the worst case, use indirect drawing for the actual case.

Memory layout: Vertex buffers need specific layouts matching the vertex shader attributes. Plan the struct layout before writing compute code.

Double buffering: For effects that read previous frame state (like simulations), use two buffers and swap each frame. This prevents read-write hazards.

Debugging: Compute errors are silent. Add validation checks, use timestamp queries to profile, and start with small test cases.

Key Takeaways

Shared buffers connect compute and render—write in compute, read in render
Buffers need both STORAGE and VERTEX (or INDEX) usage flags
Indirect drawing lets compute determine draw call parameters, avoiding CPU round-trips
WebGPU handles synchronization automatically through command submission order
Particle systems, procedural geometry, and animation are classic use cases
The entire GPU frame can run without CPU intervention beyond command submission