Optimization Strategies
Memory coalescing and batching
GPU performance comes down to two resources: compute and memory bandwidth. Modern GPUs have enormous compute capacity but relatively limited memory throughput. Most real-world bottlenecks stem from memory access patterns, not arithmetic. Understanding how to access memory efficiently yields larger gains than any algorithmic optimization.
Memory Coalescing
GPUs read memory in chunks. When threads in a workgroup access consecutive memory addresses, the hardware combines these into a single efficient transaction. When threads access scattered addresses, each access becomes a separate transaction, wasting bandwidth.
Interactive: Coalesced vs scattered memory access
Coalesced access: 8 threads read 8 consecutive addresses in 1 transaction. Scattered: each thread triggers a separate transaction, wasting bandwidth.
Consider a compute shader processing an image. Each thread handles one pixel. If thread 0 reads pixel 0, thread 1 reads pixel 1, and so on, all reads coalesce into large sequential fetches. But if thread 0 reads pixel 0, thread 1 reads pixel 1000, thread 2 reads pixel 42—each read becomes a separate memory transaction.
// Good: Consecutive access pattern
@compute @workgroup_size(256)
fn process(@builtin(global_invocation_id) id: vec3u) {
let index = id.x; // Thread N accesses element N
let value = data[index];
// ...
}// Bad: Scattered access pattern
@compute @workgroup_size(256)
fn process(@builtin(global_invocation_id) id: vec3u) {
let index = someIndirectionTable[id.x]; // Random access
let value = data[index];
// ...
}When indirect access is unavoidable—like following an index buffer—consider reorganizing your data to improve locality. Sort indices so nearby threads access nearby memory. The preprocessing cost pays off if the data is accessed multiple times.
Reducing Overdraw
Overdraw occurs when the GPU processes fragments that never reach the screen. A pixel might be computed multiple times if objects overlap, with earlier results discarded by depth testing.
Interactive: Overdraw visualization
Front-to-back sorting with early-Z reduces wasted fragment shader work. Toggle overdraw view to see which pixels are drawn multiple times.
The simplest fix: render opaque objects front-to-back. When the depth test runs before the fragment shader (early-Z), fragments behind existing geometry get rejected without running the shader. But this only works if:
- Depth testing is enabled
- The fragment shader doesn't write to
frag_depth - The shader doesn't discard fragments
// Sort opaque objects by distance to camera
opaqueObjects.sort((a, b) => {
const distA = distanceToCamera(a);
const distB = distanceToCamera(b);
return distA - distB; // Front to back
});
// Render in sorted order
for (const obj of opaqueObjects) {
renderObject(obj);
}For transparent objects, you typically need back-to-front ordering instead (for correct blending), which unavoidably causes overdraw. Techniques like order-independent transparency exist but add complexity.
Another source of overdraw: fullscreen effects that cover pixels unnecessarily. If your bloom pass only affects bright areas, consider using a stencil mask to skip dark regions.
Batching Draw Calls
Each draw call has CPU overhead: setting up state, issuing commands, and synchronizing with the GPU. A scene with 10,000 objects, each drawn separately, spends more time on overhead than actual rendering.
Interactive: Individual draws vs batched draws
Each object requires a separate draw call. CPU overhead dominates.
Instancing addresses this directly. Instead of 1,000 draw calls for 1,000 trees, issue one instanced draw call. The vertex shader receives an instance ID to differentiate objects:
struct InstanceData {
transform: mat4x4f,
color: vec4f,
}
@group(0) @binding(0) var<storage, read> instances: array<InstanceData>;
@vertex
fn main(
@builtin(instance_index) instanceIdx: u32,
@location(0) position: vec3f
) -> @builtin(position) vec4f {
let instance = instances[instanceIdx];
return instance.transform * vec4f(position, 1.0);
}When objects can't be instanced (different meshes), merge them into a single buffer where possible. A level's static geometry—walls, floors, props—can often become one mega-mesh with one draw call.
Texture atlasing helps here too. If every object needs a different texture, you can't batch them. Pack textures into atlases and pass UV offsets per instance.
Pipeline State Changes
Switching render pipelines is expensive. Each pipeline represents compiled GPU state—shaders, blend modes, depth settings. Minimize switches by grouping objects that share pipelines:
// Group by pipeline, then render each group
const byPipeline = new Map<GPURenderPipeline, GameObject[]>();
for (const obj of objects) {
const pipeline = obj.material.pipeline;
if (!byPipeline.has(pipeline)) {
byPipeline.set(pipeline, []);
}
byPipeline.get(pipeline)!.push(obj);
}
for (const [pipeline, group] of byPipeline) {
renderPass.setPipeline(pipeline);
for (const obj of group) {
// Render all objects using this pipeline
renderPass.setBindGroup(0, obj.bindGroup);
renderPass.draw(obj.vertexCount);
}
}Interactive: Pipeline caching benefits
Dashed lines indicate pipeline switches. Sorting objects by pipeline minimizes expensive state changes.
Bind group changes are cheaper than pipeline changes, but still have cost. The hierarchy of state change expense, roughly:
- Pipeline change (most expensive)
- Bind group 0 change
- Bind group 1+ change (less expensive)
- Push constants (cheapest, if available)
Structure your bind groups so frequently-changing data (per-object transforms) lives in higher-numbered groups, while shared data (scene-wide lighting, camera matrices) lives in group 0.
Compute Shader Optimization
Compute shaders have their own optimization concerns. Workgroup size matters—too small wastes occupancy, too large limits how many workgroups can run concurrently.
A reasonable starting point: use workgroup sizes that are multiples of the warp/wavefront size (32 for NVIDIA, 64 for AMD). Common choices are 64, 128, or 256 threads per workgroup.
// Decent default for many workloads
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3u) {
// ...
}For 2D data like images, use 2D workgroups:
// 16×16 = 256 threads, good for image processing
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) id: vec3u) {
let x = id.x;
let y = id.y;
// ...
}Shared memory (workgroup variables) enables cooperation between threads. When multiple threads need the same data, load it once into shared memory rather than having each thread fetch from global memory:
var<workgroup> sharedTile: array<f32, 256>;
@compute @workgroup_size(16, 16)
fn main(
@builtin(local_invocation_index) localIdx: u32,
@builtin(workgroup_id) groupId: vec3u
) {
// Each thread loads one element
sharedTile[localIdx] = globalData[groupId.x * 256 + localIdx];
workgroupBarrier(); // Ensure all loads complete
// Now all threads can read any element from shared memory
}When Not to Optimize
Premature optimization wastes development time on code that isn't a bottleneck. Profile first. If your fragment shader takes 0.1ms and your physics simulation takes 10ms, optimizing the shader won't help.
Some "optimizations" hurt readability without measurable benefit. Manually unrolling a loop that the compiler already unrolls just makes the code harder to maintain.
Target the actual bottleneck. If you're memory-bound, reducing arithmetic won't help. If you're compute-bound, improving memory access won't help. Timestamp queries (from the previous chapter) reveal which constraint you're hitting.
Key Takeaways
- Memory bandwidth limits most GPU workloads—optimize access patterns before compute
- Coalesced memory access (consecutive threads accessing consecutive addresses) enables efficient bulk transfers
- Front-to-back rendering of opaque objects enables early-Z rejection, eliminating overdraw
- Batching draw calls through instancing or mesh merging reduces CPU overhead
- Group objects by pipeline to minimize expensive state changes
- Profile before optimizing—fix the actual bottleneck, not the assumed one