Instancing
Drawing many objects efficiently with a single draw call
The Problem: Drawing Many Objects
Imagine you need to render a forest with 10,000 trees. Each tree uses the same mesh—trunk, branches, leaves—just positioned differently. The naive approach is straightforward: loop through all trees and issue a draw call for each one.
for (const tree of trees) {
setUniform('modelMatrix', tree.transform);
draw(treeMesh);
}This works, but it is catastrophically slow. Each draw call has overhead: the CPU must prepare the command, validate state, and communicate with the GPU driver. With 10,000 draw calls, that overhead dominates. The GPU spends more time waiting for instructions than actually rendering triangles.
The draw call count is one of the most important performance metrics in graphics. Professional game engines obsess over reducing it. The solution for rendering many similar objects is instancing—a technique that lets you draw thousands of copies with a single draw call.
The Instancing Solution
Instancing tells the GPU: "Draw this mesh N times, and I will give you per-instance data to differentiate each copy."
Instead of 10,000 draw calls, you issue one:
pass.draw(treeMesh.vertexCount, 10000); // Draw 10,000 instancesThe GPU launches the vertex shader once per vertex, per instance. For a tree with 1,000 vertices and 10,000 instances, that is 10 million vertex shader invocations—all from a single draw call. The GPU handles this parallelism naturally; it is what GPUs are built for.
Interactive: Instanced Cubes
All cubes render with a single draw call. Adjust the count and watch the FPS stay stable.
Notice how smoothly hundreds of cubes render. Each cube is a separate instance of the same mesh, but the GPU processes them all in one batch. The draw call count stays at one, regardless of how many cubes appear.
Instance Buffers: Per-Instance Data
Each instance needs something unique—at minimum, a position. Usually you want per-instance color, scale, rotation, or other properties. This data lives in an instance buffer, structured like a vertex buffer but stepped per instance rather than per vertex.
Consider a buffer holding position and color for each instance:
// Per-instance data: position (vec3) + color (vec3) = 24 bytes per instance
const instanceData = new Float32Array(instanceCount * 6);
for (let i = 0; i < instanceCount; i++) {
const offset = i * 6;
instanceData[offset + 0] = positions[i].x;
instanceData[offset + 1] = positions[i].y;
instanceData[offset + 2] = positions[i].z;
instanceData[offset + 3] = colors[i].r;
instanceData[offset + 4] = colors[i].g;
instanceData[offset + 5] = colors[i].b;
}
const instanceBuffer = device.createBuffer({
size: instanceData.byteLength,
usage: GPUBufferUsage.VERTEX | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(instanceBuffer, 0, instanceData);The pipeline layout describes how the GPU should interpret this buffer:
const pipeline = device.createRenderPipeline({
vertex: {
module: shaderModule,
entryPoint: 'vertexMain',
buffers: [
// Buffer 0: Per-vertex data (positions, normals, UVs)
{
arrayStride: 32, // 8 floats per vertex
stepMode: 'vertex',
attributes: [
{ shaderLocation: 0, offset: 0, format: 'float32x3' }, // position
{ shaderLocation: 1, offset: 12, format: 'float32x3' }, // normal
{ shaderLocation: 2, offset: 24, format: 'float32x2' }, // uv
],
},
// Buffer 1: Per-instance data
{
arrayStride: 24, // 6 floats per instance
stepMode: 'instance', // Key difference!
attributes: [
{ shaderLocation: 3, offset: 0, format: 'float32x3' }, // instance position
{ shaderLocation: 4, offset: 12, format: 'float32x3' }, // instance color
],
},
],
},
// ... fragment, etc.
});The critical difference is stepMode: 'instance'. For vertex buffers, the GPU advances to the next element after each vertex. For instance buffers, it advances after each instance—all vertices of instance 0 see the same instance data, then all vertices of instance 1 see the next row.
Interactive: Instance Buffer Layout
Vertex Buffer (stepMode: vertex)
Same vertex data for all instances
Instance Buffer (stepMode: instance)
Unique data per instance (offset, color)
Shader Invocation: Instance 0, Vertex 0
Click instances and vertices to see how data combines in the shader.
The diagram shows how vertex and instance data combine. Each vertex shader invocation receives both its per-vertex attributes (position, normal, UV) and the per-instance attributes (instance position, color) for whichever instance it belongs to.
@builtin(instance_index): Identifying Instances
In the shader, you often need to know which instance you are processing. WGSL provides @builtin(instance_index) for this:
struct VertexInput {
@location(0) position: vec3f,
@location(1) normal: vec3f,
@location(2) uv: vec2f,
@location(3) instancePosition: vec3f,
@location(4) instanceColor: vec3f,
}
struct VertexOutput {
@builtin(position) position: vec4f,
@location(0) color: vec3f,
@location(1) normal: vec3f,
}
@vertex
fn vertexMain(
input: VertexInput,
@builtin(instance_index) instanceIndex: u32
) -> VertexOutput {
var output: VertexOutput;
// Offset vertex by instance position
let worldPosition = input.position + input.instancePosition;
output.position = viewProjection * vec4f(worldPosition, 1.0);
// Pass instance color to fragment shader
output.color = input.instanceColor;
output.normal = input.normal;
return output;
}The instance index is a zero-based counter: the first instance is 0, the second is 1, and so on. This is useful when your per-instance data lives in a storage buffer instead of a vertex buffer, or when you need to compute something based on instance ID.
For example, you might procedurally generate colors:
@vertex
fn vertexMain(
input: VertexInput,
@builtin(instance_index) instanceIndex: u32
) -> VertexOutput {
// Generate a unique color per instance
let hue = f32(instanceIndex) * 0.618033988749; // Golden ratio for good distribution
let color = hsv_to_rgb(vec3f(fract(hue), 0.8, 0.9));
// ...
}Instancing vs. Loop: Performance Comparison
The performance difference between instanced and non-instanced rendering is dramatic. Let us compare drawing 1,000 cubes both ways.
Interactive: Instancing vs Loop Performance
Loop (Per-Object Draw Calls)
Instanced (Single Draw Call)
The loop approach issues 1,000 draw calls. Each call has CPU overhead—typically 10-100 microseconds—and prevents the GPU from fully parallelizing the work. The instanced approach issues one draw call. The GPU processes all instances in parallel, limited only by its core count rather than by CPU-GPU communication.
In real applications, the difference is often 10-50× for thousands of objects. Games like strategy titles with massive armies, or particle systems with millions of particles, rely entirely on instancing to achieve acceptable performance.
Indirect Instancing: Count from Buffer
Standard instancing requires the CPU to specify the instance count. But what if the count is computed on the GPU—say, by a compute shader that culls invisible objects?
Indirect drawing solves this. Instead of passing the instance count as a function parameter, you point to a buffer containing the draw parameters:
// Create an indirect buffer holding draw arguments
const indirectBuffer = device.createBuffer({
size: 16, // 4 uint32 values
usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.COPY_DST,
});
// The buffer holds: [vertexCount, instanceCount, firstVertex, firstInstance]
const args = new Uint32Array([36, 500, 0, 0]); // 36 vertices, 500 instances
device.queue.writeBuffer(indirectBuffer, 0, args);
// Draw using the buffer
pass.drawIndirect(indirectBuffer, 0);The power is that a compute shader can write to this buffer, determining the instance count dynamically:
@group(0) @binding(0) var<storage, read_write> drawArgs: array<u32, 4>;
@group(0) @binding(1) var<storage, read> visibleInstances: array<u32>;
@compute @workgroup_size(1)
fn updateDrawArgs() {
// Set instance count based on GPU-computed visibility
drawArgs[1] = arrayLength(&visibleInstances);
}This enables GPU-driven rendering pipelines where the GPU decides what to draw without round-tripping to the CPU.
Interactive: Indirect Draw Demo
Scene Objects
Indirect Buffer Contents
Pipeline
GPU determines instance count—no CPU round-trip needed.
Indirect drawing is an advanced technique used in modern engines for culling, LOD selection, and dynamic scene management. The CPU sets up the scene once; the GPU continuously updates what actually gets rendered.
Practical Patterns
Instancing shines in specific scenarios:
Vegetation and foliage. Forests, grass fields, and decorative plants are classic instancing candidates. The same mesh appears thousands of times with varying position, scale, and rotation.
Particle systems. Each particle is an instance of a simple quad or mesh. The instance buffer holds position, velocity, color, and lifetime.
Crowds and armies. Strategy games render armies of units using instancing. Each unit type is one mesh instanced many times.
Debris and decals. Scattered objects like rocks, trash, bullet holes—anything that repeats frequently.
The common thread: many copies of the same geometry with small per-instance variations.
Limitations and Considerations
Instancing is not free. Each instance still requires vertex processing, and the instance buffer consumes memory. If your instances are visually distinct (different meshes, not just transforms), instancing does not help—you need separate draw calls.
Memory bandwidth can also become a bottleneck. With millions of instances, the instance buffer grows large, and reading all that data becomes the limiting factor. At extreme scales, techniques like GPU-driven culling and indirect rendering become necessary.
Finally, instancing works best when instances share rendering state. If different instances need different textures or shaders, you either batch by state (multiple instanced draw calls, one per unique state) or use bindless textures and uber-shaders.
Key Takeaways
- Draw call overhead is a major performance bottleneck when rendering many objects
- Instancing lets you draw thousands of copies with a single draw call
- Instance buffers use
stepMode: 'instance'to provide per-instance data - @builtin(instance_index) identifies which instance the shader is processing
- Indirect drawing lets the GPU control the instance count, enabling GPU-driven pipelines
- Instancing is ideal for vegetation, particles, crowds, and anything with many similar objects
- The performance difference can be 10-50× compared to naive per-object draw calls