Optimization Strategies

Memory coalescing and batching

GPU performance comes down to two resources: compute and memory bandwidth. Modern GPUs have enormous compute capacity but relatively limited memory throughput. Most real-world bottlenecks stem from memory access patterns, not arithmetic. Understanding how to access memory efficiently yields larger gains than any algorithmic optimization.

Memory Coalescing

GPUs read memory in chunks. When threads in a workgroup access consecutive memory addresses, the hardware combines these into a single efficient transaction. When threads access scattered addresses, each access becomes a separate transaction, wasting bandwidth.

Interactive: Coalesced vs scattered memory access

Threads (workgroup)
T0
T1
T2
T3
T4
T5
T6
T7
Memory (global)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
Memory Transactions
100%
Bandwidth Efficiency

Coalesced access: 8 threads read 8 consecutive addresses in 1 transaction. Scattered: each thread triggers a separate transaction, wasting bandwidth.

Consider a compute shader processing an image. Each thread handles one pixel. If thread 0 reads pixel 0, thread 1 reads pixel 1, and so on, all reads coalesce into large sequential fetches. But if thread 0 reads pixel 0, thread 1 reads pixel 1000, thread 2 reads pixel 42—each read becomes a separate memory transaction.

// Good: Consecutive access pattern
@compute @workgroup_size(256)
fn process(@builtin(global_invocation_id) id: vec3u) {
    let index = id.x;  // Thread N accesses element N
    let value = data[index];
    // ...
}
wgsl
// Bad: Scattered access pattern  
@compute @workgroup_size(256)
fn process(@builtin(global_invocation_id) id: vec3u) {
    let index = someIndirectionTable[id.x];  // Random access
    let value = data[index];
    // ...
}
wgsl

When indirect access is unavoidable—like following an index buffer—consider reorganizing your data to improve locality. Sort indices so nearby threads access nearby memory. The preprocessing cost pays off if the data is accessed multiple times.

Reducing Overdraw

Overdraw occurs when the GPU processes fragments that never reach the screen. A pixel might be computed multiple times if objects overlap, with earlier results discarded by depth testing.

Interactive: Overdraw visualization

39,800
Fragment invocations
26,400
Visible fragments
66%
Efficiency

Front-to-back sorting with early-Z reduces wasted fragment shader work. Toggle overdraw view to see which pixels are drawn multiple times.

The simplest fix: render opaque objects front-to-back. When the depth test runs before the fragment shader (early-Z), fragments behind existing geometry get rejected without running the shader. But this only works if:

  1. Depth testing is enabled
  2. The fragment shader doesn't write to frag_depth
  3. The shader doesn't discard fragments
// Sort opaque objects by distance to camera
opaqueObjects.sort((a, b) => {
  const distA = distanceToCamera(a);
  const distB = distanceToCamera(b);
  return distA - distB;  // Front to back
});
 
// Render in sorted order
for (const obj of opaqueObjects) {
  renderObject(obj);
}
typescript

For transparent objects, you typically need back-to-front ordering instead (for correct blending), which unavoidably causes overdraw. Techniques like order-independent transparency exist but add complexity.

Another source of overdraw: fullscreen effects that cover pixels unnecessarily. If your bloom pass only affects bright areas, consider using a stencil mask to skip dark regions.

Batching Draw Calls

Each draw call has CPU overhead: setting up state, issuing commands, and synchronizing with the GPU. A scene with 10,000 objects, each drawn separately, spends more time on overhead than actual rendering.

Interactive: Individual draws vs batched draws

100
Draw Calls
1.00ms
CPU Time
0.10ms
GPU Time
CPU
GPU
faster
vs individual draws

Each object requires a separate draw call. CPU overhead dominates.

Instancing addresses this directly. Instead of 1,000 draw calls for 1,000 trees, issue one instanced draw call. The vertex shader receives an instance ID to differentiate objects:

struct InstanceData {
    transform: mat4x4f,
    color: vec4f,
}
 
@group(0) @binding(0) var<storage, read> instances: array<InstanceData>;
 
@vertex
fn main(
    @builtin(instance_index) instanceIdx: u32,
    @location(0) position: vec3f
) -> @builtin(position) vec4f {
    let instance = instances[instanceIdx];
    return instance.transform * vec4f(position, 1.0);
}
wgsl

When objects can't be instanced (different meshes), merge them into a single buffer where possible. A level's static geometry—walls, floors, props—can often become one mega-mesh with one draw call.

Texture atlasing helps here too. If every object needs a different texture, you can't batch them. Pack textures into atlases and pass UV offsets per instance.

Pipeline State Changes

Switching render pipelines is expensive. Each pipeline represents compiled GPU state—shaders, blend modes, depth settings. Minimize switches by grouping objects that share pipelines:

// Group by pipeline, then render each group
const byPipeline = new Map<GPURenderPipeline, GameObject[]>();
for (const obj of objects) {
  const pipeline = obj.material.pipeline;
  if (!byPipeline.has(pipeline)) {
    byPipeline.set(pipeline, []);
  }
  byPipeline.get(pipeline)!.push(obj);
}
 
for (const [pipeline, group] of byPipeline) {
  renderPass.setPipeline(pipeline);
  for (const obj of group) {
    // Render all objects using this pipeline
    renderPass.setBindGroup(0, obj.bindGroup);
    renderPass.draw(obj.vertexCount);
  }
}
typescript

Interactive: Pipeline caching benefits

Render OrderPipeline
1Floor
pbr
2Wall 1
pbr
3Wall 2
pbr
4Lamp
emissive
5Window
glass
6Table
pbr
7Plant
foliage
8Light 2
emissive
6
Pipeline Switches
8
Draw Calls
Suboptimal
Ordering

Dashed lines indicate pipeline switches. Sorting objects by pipeline minimizes expensive state changes.

Bind group changes are cheaper than pipeline changes, but still have cost. The hierarchy of state change expense, roughly:

  1. Pipeline change (most expensive)
  2. Bind group 0 change
  3. Bind group 1+ change (less expensive)
  4. Push constants (cheapest, if available)

Structure your bind groups so frequently-changing data (per-object transforms) lives in higher-numbered groups, while shared data (scene-wide lighting, camera matrices) lives in group 0.

Compute Shader Optimization

Compute shaders have their own optimization concerns. Workgroup size matters—too small wastes occupancy, too large limits how many workgroups can run concurrently.

A reasonable starting point: use workgroup sizes that are multiples of the warp/wavefront size (32 for NVIDIA, 64 for AMD). Common choices are 64, 128, or 256 threads per workgroup.

// Decent default for many workloads
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3u) {
    // ...
}
wgsl

For 2D data like images, use 2D workgroups:

// 16×16 = 256 threads, good for image processing
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) id: vec3u) {
    let x = id.x;
    let y = id.y;
    // ...
}
wgsl

Shared memory (workgroup variables) enables cooperation between threads. When multiple threads need the same data, load it once into shared memory rather than having each thread fetch from global memory:

var<workgroup> sharedTile: array<f32, 256>;
 
@compute @workgroup_size(16, 16)
fn main(
    @builtin(local_invocation_index) localIdx: u32,
    @builtin(workgroup_id) groupId: vec3u
) {
    // Each thread loads one element
    sharedTile[localIdx] = globalData[groupId.x * 256 + localIdx];
    
    workgroupBarrier();  // Ensure all loads complete
    
    // Now all threads can read any element from shared memory
}
wgsl

When Not to Optimize

Premature optimization wastes development time on code that isn't a bottleneck. Profile first. If your fragment shader takes 0.1ms and your physics simulation takes 10ms, optimizing the shader won't help.

Some "optimizations" hurt readability without measurable benefit. Manually unrolling a loop that the compiler already unrolls just makes the code harder to maintain.

Target the actual bottleneck. If you're memory-bound, reducing arithmetic won't help. If you're compute-bound, improving memory access won't help. Timestamp queries (from the previous chapter) reveal which constraint you're hitting.

Key Takeaways

  • Memory bandwidth limits most GPU workloads—optimize access patterns before compute
  • Coalesced memory access (consecutive threads accessing consecutive addresses) enables efficient bulk transfers
  • Front-to-back rendering of opaque objects enables early-Z rejection, eliminating overdraw
  • Batching draw calls through instancing or mesh merging reduces CPU overhead
  • Group objects by pipeline to minimize expensive state changes
  • Profile before optimizing—fix the actual bottleneck, not the assumed one