Instancing

Drawing many objects efficiently with a single draw call

The Problem: Drawing Many Objects

Imagine you need to render a forest with 10,000 trees. Each tree uses the same mesh—trunk, branches, leaves—just positioned differently. The naive approach is straightforward: loop through all trees and issue a draw call for each one.

for (const tree of trees) {
  setUniform('modelMatrix', tree.transform);
  draw(treeMesh);
}

javascript

This works, but it is catastrophically slow. Each draw call has overhead: the CPU must prepare the command, validate state, and communicate with the GPU driver. With 10,000 draw calls, that overhead dominates. The GPU spends more time waiting for instructions than actually rendering triangles.

The draw call count is one of the most important performance metrics in graphics. Professional game engines obsess over reducing it. The solution for rendering many similar objects is instancing—a technique that lets you draw thousands of copies with a single draw call.

The Instancing Solution

Instancing tells the GPU: "Draw this mesh N times, and I will give you per-instance data to differentiate each copy."

Instead of 10,000 draw calls, you issue one:

pass.draw(treeMesh.vertexCount, 10000); // Draw 10,000 instances

javascript

The GPU launches the vertex shader once per vertex, per instance. For a tree with 1,000 vertices and 10,000 instances, that is 10 million vertex shader invocations—all from a single draw call. The GPU handles this parallelism naturally; it is what GPUs are built for.

Interactive: Instanced Cubes

Instance Count200 cubes

All cubes render with a single draw call. Adjust the count and watch the FPS stay stable.

Notice how smoothly hundreds of cubes render. Each cube is a separate instance of the same mesh, but the GPU processes them all in one batch. The draw call count stays at one, regardless of how many cubes appear.

Instance Buffers: Per-Instance Data

Each instance needs something unique—at minimum, a position. Usually you want per-instance color, scale, rotation, or other properties. This data lives in an instance buffer, structured like a vertex buffer but stepped per instance rather than per vertex.

Consider a buffer holding position and color for each instance:

// Per-instance data: position (vec3) + color (vec3) = 24 bytes per instance
const instanceData = new Float32Array(instanceCount * 6);
 
for (let i = 0; i < instanceCount; i++) {
  const offset = i * 6;
  instanceData[offset + 0] = positions[i].x;
  instanceData[offset + 1] = positions[i].y;
  instanceData[offset + 2] = positions[i].z;
  instanceData[offset + 3] = colors[i].r;
  instanceData[offset + 4] = colors[i].g;
  instanceData[offset + 5] = colors[i].b;
}
 
const instanceBuffer = device.createBuffer({
  size: instanceData.byteLength,
  usage: GPUBufferUsage.VERTEX | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(instanceBuffer, 0, instanceData);

javascript

The pipeline layout describes how the GPU should interpret this buffer:

const pipeline = device.createRenderPipeline({
  vertex: {
    module: shaderModule,
    entryPoint: 'vertexMain',
    buffers: [
      // Buffer 0: Per-vertex data (positions, normals, UVs)
      {
        arrayStride: 32, // 8 floats per vertex
        stepMode: 'vertex',
        attributes: [
          { shaderLocation: 0, offset: 0, format: 'float32x3' },  // position
          { shaderLocation: 1, offset: 12, format: 'float32x3' }, // normal
          { shaderLocation: 2, offset: 24, format: 'float32x2' }, // uv
        ],
      },
      // Buffer 1: Per-instance data
      {
        arrayStride: 24, // 6 floats per instance
        stepMode: 'instance', // Key difference!
        attributes: [
          { shaderLocation: 3, offset: 0, format: 'float32x3' },  // instance position
          { shaderLocation: 4, offset: 12, format: 'float32x3' }, // instance color
        ],
      },
    ],
  },
  // ... fragment, etc.
});

javascript

The critical difference is stepMode: 'instance'. For vertex buffers, the GPU advances to the next element after each vertex. For instance buffers, it advances after each instance—all vertices of instance 0 see the same instance data, then all vertices of instance 1 see the next row.

Interactive: Instance Buffer Layout

Vertex Buffer (stepMode: vertex)

Same vertex data for all instances

Instance Buffer (stepMode: instance)

Unique data per instance (offset, color)

Shader Invocation: Instance 0, Vertex 0

Vertex position: [-0.5, -0.5, 0]

Instance offset: [0, 0, 0]

Final position: [-0.5, -0.5, 0]

Click instances and vertices to see how data combines in the shader.

The diagram shows how vertex and instance data combine. Each vertex shader invocation receives both its per-vertex attributes (position, normal, UV) and the per-instance attributes (instance position, color) for whichever instance it belongs to.

@builtin(instance_index): Identifying Instances

In the shader, you often need to know which instance you are processing. WGSL provides @builtin(instance_index) for this:

struct VertexInput {
  @location(0) position: vec3f,
  @location(1) normal: vec3f,
  @location(2) uv: vec2f,
  @location(3) instancePosition: vec3f,
  @location(4) instanceColor: vec3f,
}
 
struct VertexOutput {
  @builtin(position) position: vec4f,
  @location(0) color: vec3f,
  @location(1) normal: vec3f,
}
 
@vertex
fn vertexMain(
  input: VertexInput,
  @builtin(instance_index) instanceIndex: u32
) -> VertexOutput {
  var output: VertexOutput;
  
  // Offset vertex by instance position
  let worldPosition = input.position + input.instancePosition;
  output.position = viewProjection * vec4f(worldPosition, 1.0);
  
  // Pass instance color to fragment shader
  output.color = input.instanceColor;
  output.normal = input.normal;
  
  return output;
}

wgsl

The instance index is a zero-based counter: the first instance is 0, the second is 1, and so on. This is useful when your per-instance data lives in a storage buffer instead of a vertex buffer, or when you need to compute something based on instance ID.

For example, you might procedurally generate colors:

@vertex
fn vertexMain(
  input: VertexInput,
  @builtin(instance_index) instanceIndex: u32
) -> VertexOutput {
  // Generate a unique color per instance
  let hue = f32(instanceIndex) * 0.618033988749; // Golden ratio for good distribution
  let color = hsv_to_rgb(vec3f(fract(hue), 0.8, 0.9));
  
  // ...
}

wgsl

Instancing vs. Loop: Performance Comparison

The performance difference between instanced and non-instanced rendering is dramatic. Let us compare drawing 1,000 cubes both ways.

Interactive: Instancing vs Loop Performance

Loop (Per-Object Draw Calls)

Draw calls:1,000

Objects drawn:0

Time:0.0ms

Instanced (Single Draw Call)

Draw calls:1

Objects drawn:0

Time:0.0ms

Object Count1,000

The loop approach issues 1,000 draw calls. Each call has CPU overhead—typically 10-100 microseconds—and prevents the GPU from fully parallelizing the work. The instanced approach issues one draw call. The GPU processes all instances in parallel, limited only by its core count rather than by CPU-GPU communication.

In real applications, the difference is often 10-50× for thousands of objects. Games like strategy titles with massive armies, or particle systems with millions of particles, rely entirely on instancing to achieve acceptable performance.

Indirect Instancing: Count from Buffer

Standard instancing requires the CPU to specify the instance count. But what if the count is computed on the GPU—say, by a compute shader that culls invisible objects?

Indirect drawing solves this. Instead of passing the instance count as a function parameter, you point to a buffer containing the draw parameters:

// Create an indirect buffer holding draw arguments
const indirectBuffer = device.createBuffer({
  size: 16, // 4 uint32 values
  usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.COPY_DST,
});
 
// The buffer holds: [vertexCount, instanceCount, firstVertex, firstInstance]
const args = new Uint32Array([36, 500, 0, 0]); // 36 vertices, 500 instances
device.queue.writeBuffer(indirectBuffer, 0, args);
 
// Draw using the buffer
pass.drawIndirect(indirectBuffer, 0);

javascript

The power is that a compute shader can write to this buffer, determining the instance count dynamically:

@group(0) @binding(0) var<storage, read_write> drawArgs: array<u32, 4>;
@group(0) @binding(1) var<storage, read> visibleInstances: array<u32>;
 
@compute @workgroup_size(1)
fn updateDrawArgs() {
  // Set instance count based on GPU-computed visibility
  drawArgs[1] = arrayLength(&visibleInstances);
}

wgsl

This enables GPU-driven rendering pipelines where the GPU decides what to draw without round-tripping to the CPU.

Interactive: Indirect Draw Demo

Scene Objects

Total objects:100

Visible (after cull):50

Indirect Buffer Contents

[0] vertexCount:36

[1] instanceCount:50

[2] firstVertex:0

[3] firstInstance:0

Total Objects100

Visibility (GPU Cull Result)50%

Pipeline

Compute Shader

→

Write instanceCount

→

drawIndirect()

GPU determines instance count—no CPU round-trip needed.

Indirect drawing is an advanced technique used in modern engines for culling, LOD selection, and dynamic scene management. The CPU sets up the scene once; the GPU continuously updates what actually gets rendered.

Practical Patterns

Instancing shines in specific scenarios:

Vegetation and foliage. Forests, grass fields, and decorative plants are classic instancing candidates. The same mesh appears thousands of times with varying position, scale, and rotation.

Particle systems. Each particle is an instance of a simple quad or mesh. The instance buffer holds position, velocity, color, and lifetime.

Crowds and armies. Strategy games render armies of units using instancing. Each unit type is one mesh instanced many times.

Debris and decals. Scattered objects like rocks, trash, bullet holes—anything that repeats frequently.

The common thread: many copies of the same geometry with small per-instance variations.

Limitations and Considerations

Instancing is not free. Each instance still requires vertex processing, and the instance buffer consumes memory. If your instances are visually distinct (different meshes, not just transforms), instancing does not help—you need separate draw calls.

Memory bandwidth can also become a bottleneck. With millions of instances, the instance buffer grows large, and reading all that data becomes the limiting factor. At extreme scales, techniques like GPU-driven culling and indirect rendering become necessary.

Finally, instancing works best when instances share rendering state. If different instances need different textures or shaders, you either batch by state (multiple instanced draw calls, one per unique state) or use bindless textures and uber-shaders.

Key Takeaways

Draw call overhead is a major performance bottleneck when rendering many objects
Instancing lets you draw thousands of copies with a single draw call
Instance buffers use stepMode: 'instance' to provide per-instance data
@builtin(instance_index) identifies which instance the shader is processing
Indirect drawing lets the GPU control the instance count, enabling GPU-driven pipelines
Instancing is ideal for vegetation, particles, crowds, and anything with many similar objects
The performance difference can be 10-50× compared to naive per-object draw calls