Your First Compute Shader

Parallel computation on the GPU

Beyond Rendering

Everything we have seen so far uses the GPU for rendering—transforming vertices, rasterizing triangles, shading pixels. The render pipeline is optimized for one specific task: turning 3D geometry into 2D images.

But GPUs are parallel processors at their core. All those thousands of cores can do more than push pixels. Compute shaders unlock the GPU for general-purpose computation: physics simulations, image processing, machine learning, cryptography—anything that benefits from massive parallelism.

The render pipeline is a fixed sequence: vertex shader → rasterizer → fragment shader. Compute shaders break free from this structure. There are no vertices, no fragments, no render targets. You simply launch threads that execute your code in parallel.

The Dispatch Model

In a fragment shader, the GPU automatically launches one thread per pixel. You do not choose how many threads to run; the output resolution determines it.

Compute shaders work differently. You explicitly dispatch a grid of threads by calling the dispatch function with three dimensions. The GPU then spawns that many threads, each running your shader code.

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3u) {
  // This code runs in parallel across all dispatched threads
  // id.x uniquely identifies this thread
}

wgsl

The @workgroup_size(64) attribute declares that threads are organized into groups of 64. When you dispatch, you specify how many workgroups to launch. The total thread count equals workgroup size multiplied by dispatch count.

Interactive: Dispatch Grid

Workgroup size: 88

Dispatch count: 44

dispatchWorkgroups(4)

WG 0

WG 1

WG 2

WG 3

32total threads

8 threads × 4 workgroups = 32 threads

Hover over individual threads to see their IDs. Each workgroup runs its threads in parallel.

If you dispatch 4 workgroups of size 64, you get 256 threads. If you dispatch 100 workgroups, you get 6,400 threads. The GPU schedules these threads across its compute units, running as many in parallel as the hardware allows.

A Minimal Example: Doubling Numbers

Let us write a compute shader that doubles every number in an array. The shader reads from an input buffer, multiplies each value by 2, and writes to an output buffer.

@group(0) @binding(0) var<storage, read> input: array<f32>;
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
 
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3u) {
  output[id.x] = input[id.x] * 2.0;
}

wgsl

Each thread processes one element. Thread 0 doubles input[0], thread 1 doubles input[1], and so on. If the input array has 1,000 elements, we dispatch enough workgroups to cover all 1,000: ceil(1000 / 64) = 16 workgroups.

The global_invocation_id builtin gives each thread its unique index across the entire dispatch. Thread 0 in workgroup 0 has id.x = 0. Thread 63 in workgroup 0 has id.x = 63. Thread 0 in workgroup 1 has id.x = 64.

Interactive: Buffer Doubling

Input Buffer

[0]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

↓

output[id.x] = input[id.x] * 2.0

↓

Output Buffer

—

Workgroup size: 4

Workgroups dispatched: 2

Each thread reads one input value, doubles it, and writes to the output buffer. In reality, all threads run in parallel—the animation shows them sequentially for clarity.

The computation is trivial, but the pattern scales. With thousands of threads running in parallel, operations that would take milliseconds on a CPU complete in microseconds on a GPU.

Setting Up the Pipeline

Creating a compute pipeline involves fewer steps than a render pipeline. No vertex layouts, no render targets—just the shader and bind groups.

// Create the shader module
const shaderModule = device.createShaderModule({
  code: `
    @group(0) @binding(0) var<storage, read> input: array<f32>;
    @group(0) @binding(1) var<storage, read_write> output: array<f32>;
 
    @compute @workgroup_size(64)
    fn main(@builtin(global_invocation_id) id: vec3u) {
      output[id.x] = input[id.x] * 2.0;
    }
  `
});
 
// Create the compute pipeline
const pipeline = device.createComputePipeline({
  layout: 'auto',
  compute: {
    module: shaderModule,
    entryPoint: 'main'
  }
});

typescript

The pipeline needs buffers to read from and write to. Storage buffers with the STORAGE usage flag work for compute shaders.

const inputBuffer = device.createBuffer({
  size: dataSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
 
const outputBuffer = device.createBuffer({
  size: dataSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});

typescript

Dispatching Work

Unlike render passes, compute work uses a compute pass. Inside the pass, you set the pipeline, bind your buffers, and dispatch workgroups.

const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
 
passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatchWorkgroups(Math.ceil(elementCount / 64));
 
passEncoder.end();
device.queue.submit([commandEncoder.finish()]);

typescript

The dispatchWorkgroups call specifies how many workgroups to launch. With a workgroup size of 64 and 1,000 elements, we dispatch 16 workgroups (1,024 threads). The extra 24 threads do no harm—their indices exceed the array bounds, and a well-written shader checks for this.

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3u) {
  if (id.x >= arrayLength(&input)) {
    return;
  }
  output[id.x] = input[id.x] * 2.0;
}

wgsl

Reading Results Back

GPU memory is not directly accessible from JavaScript. To see the compute results, you must copy the output buffer to a mappable staging buffer, then read the data.

// Create a staging buffer for readback
const stagingBuffer = device.createBuffer({
  size: dataSize,
  usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
});
 
// Copy output to staging
const copyEncoder = device.createCommandEncoder();
copyEncoder.copyBufferToBuffer(outputBuffer, 0, stagingBuffer, 0, dataSize);
device.queue.submit([copyEncoder.finish()]);
 
// Map and read
await stagingBuffer.mapAsync(GPUMapMode.READ);
const resultData = new Float32Array(stagingBuffer.getMappedRange());
console.log(resultData); // The doubled values
stagingBuffer.unmap();

typescript

This mapping process is asynchronous. The GPU might still be processing when you call mapAsync, so the promise resolves only after the GPU finishes and the data is available for CPU access.

Compute vs Render

When should you use compute shaders versus the render pipeline?

Interactive: When to Use Each

Drawing 3D meshes

Rendering triangles to the screen

Compute Shader

10%

Render Pipeline

95%

The render pipeline is purpose-built for this. Vertex transformation, rasterization, and fragment shading are all optimized in hardware.

Select different use cases to see which approach fits best. The percentages indicate suitability, not performance.

The render pipeline excels at its designed purpose: rasterizing geometry into images. The fixed-function stages (vertex assembly, primitive assembly, rasterization, blending) are highly optimized for this workload. If you are drawing triangles to a screen, the render pipeline is the right choice.

Compute shaders shine when your problem does not fit the render model. Physics simulations update particle positions—no triangles involved. Image filters process textures without rendering anything. Machine learning inference performs matrix multiplications across thousands of weights.

Some workloads use both. A particle system might update positions in a compute shader, then render the particles using the render pipeline. The compute pass writes to a buffer that the vertex shader reads in the next frame.

The key distinction is flexibility versus specialization. The render pipeline is fast because it makes assumptions about your workload. Compute shaders are general because they make no assumptions—you control everything.

Thread Organization

You have already seen global_invocation_id, but compute shaders provide several built-in values for thread identification:

global_invocation_id: Unique index across the entire dispatch
local_invocation_id: Index within the current workgroup (0 to workgroup_size - 1)
workgroup_id: Which workgroup this thread belongs to
num_workgroups: Total number of dispatched workgroups

These indices can be 1D, 2D, or 3D. For image processing, you might use a 2D dispatch where id.xy corresponds to pixel coordinates. For volume data, a 3D dispatch makes sense.

@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) id: vec3u) {
  let x = id.x;
  let y = id.y;
  // Process pixel (x, y)
}

wgsl

With a workgroup size of 8×8, each workgroup contains 64 threads arranged in a grid. Dispatch (10, 10) workgroups and you cover an 80×80 image.

Key Takeaways

Compute shaders unlock the GPU for general-purpose parallel computation, not just rendering
You explicitly dispatch workgroups; total threads = workgroup_size × dispatch_count
global_invocation_id uniquely identifies each thread across the entire dispatch
Storage buffers hold input and output data for compute shaders
Reading results requires copying to a staging buffer and mapping it for CPU access
Use compute when your problem does not fit the render model; use render for drawing geometry