Profiling and Timestamps

Measuring GPU performance accurately

Optimizing GPU code without measurement is guesswork. You might optimize code that accounts for 1% of your frame time while ignoring the actual bottleneck. WebGPU provides timestamp queries to measure exactly how long GPU operations take, giving you the data needed to make informed decisions.

The Measurement Problem

GPU execution is asynchronous. When you call queue.submit(), the work doesn't happen immediately—it gets scheduled for later execution. This means you can't simply wrap GPU calls with performance.now() and expect meaningful results.

// This measures nothing useful
const start = performance.now();
queue.submit([commandBuffer]);
const end = performance.now();
// end - start ≈ 0.01ms (just the CPU time to enqueue)

typescript

The CPU returns almost immediately after submitting work. The GPU might not even start executing for several milliseconds, and the actual computation takes longer still. To measure GPU time, you need timestamps recorded on the GPU itself.

Timestamp Queries

WebGPU provides the timestamp-query feature for GPU-side timing. You create a query set, write timestamps at specific points in your command buffer, and later read the results back to the CPU.

// Check if timestamps are supported
const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has('timestamp-query')) {
  console.log('Timestamp queries not supported');
}
 
// Request the feature when creating the device
const device = await adapter.requestDevice({
  requiredFeatures: ['timestamp-query'],
});

typescript

Once you have a device with timestamp support, create a query set to hold the timestamp values:

const querySet = device.createQuerySet({
  type: 'timestamp',
  count: 2, // Start and end timestamps
});
 
// Buffer to resolve query results into
const queryBuffer = device.createBuffer({
  size: 2 * 8, // 2 timestamps × 8 bytes each (BigInt64)
  usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC,
});
 
// Buffer readable by CPU
const readBuffer = device.createBuffer({
  size: 2 * 8,
  usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
});

typescript

Interactive: How timestamp queries work

CPU

GPU

1. Submit commands

CPU queues work for GPU execution

Step through to see how timestamps capture GPU execution time independent of CPU submission timing.

Writing Timestamps

You write timestamps at the beginning and end of passes using the timestampWrites option:

const encoder = device.createCommandEncoder();
 
const computePass = encoder.beginComputePass({
  timestampWrites: {
    querySet,
    beginningOfPassWriteIndex: 0,
    endOfPassWriteIndex: 1,
  },
});
 
computePass.setPipeline(pipeline);
computePass.setBindGroup(0, bindGroup);
computePass.dispatchWorkgroups(64, 64);
computePass.end();
 
// Resolve timestamps into buffer
encoder.resolveQuerySet(querySet, 0, 2, queryBuffer, 0);
 
// Copy to mappable buffer
encoder.copyBufferToBuffer(queryBuffer, 0, readBuffer, 0, 16);
 
device.queue.submit([encoder.finish()]);

typescript

Render passes work the same way:

const renderPass = encoder.beginRenderPass({
  colorAttachments: [...],
  timestampWrites: {
    querySet,
    beginningOfPassWriteIndex: 0,
    endOfPassWriteIndex: 1,
  },
});

typescript

Reading Results

After the GPU finishes executing and you've resolved the query set, map the read buffer to access the timestamps:

await readBuffer.mapAsync(GPUMapMode.READ);
const data = new BigInt64Array(readBuffer.getMappedRange());
 
const startTime = data[0];
const endTime = data[1];
const durationNanoseconds = Number(endTime - startTime);
const durationMs = durationNanoseconds / 1_000_000;
 
console.log(`Pass took ${durationMs.toFixed(3)}ms`);
readBuffer.unmap();

typescript

Timestamps are in nanoseconds, giving you microsecond-level precision. For typical frame budgets measured in milliseconds, this precision reveals performance characteristics invisible to coarser measurements.

Interactive: Measuring GPU operation time

Run a query to see timing results

Each run shows slight variation—this is normal GPU behavior. Average multiple runs for reliable measurements.

Multiple Passes

Real applications have many passes—shadow maps, G-buffer, lighting, post-processing. Create a larger query set to timestamp each one:

const querySet = device.createQuerySet({
  type: 'timestamp',
  count: 10, // 5 passes × 2 timestamps each
});
 
// Shadow pass: indices 0, 1
// G-buffer pass: indices 2, 3
// Lighting pass: indices 4, 5
// ...

typescript

After resolving all queries, you get a complete breakdown of where time is spent:

Interactive: Performance metrics breakdown

Frame Timeline

Shadow

G-Buffer

SSAO

Lighting

0ms9.6ms

Shadow Map

1.80ms(19%)

G-Buffer

2.40ms(25%)

SSAO

1.20ms(12%)

Lighting

3.10ms(32%)

Bloom

0.80ms(8%)

Tone Mapping

0.30ms(3%)

Identified Bottleneck

Lighting Pass (32% of frame time)

Consider reducing light count or using tiled/clustered shading

Hover over timeline segments to see pass names. The largest segment indicates where optimization effort yields the biggest returns.

Finding Bottlenecks

Raw timing data becomes useful when you identify patterns. Common scenarios:

Memory-bound workloads show little improvement when you reduce arithmetic complexity. If cutting your shader math in half doesn't halve execution time, you're waiting on memory.

Compute-bound workloads scale linearly with operation count. Simplifying the math directly reduces execution time.

Synchronization issues appear as gaps between passes. If pass B starts long after pass A ends, something is forcing serialization.

Bandwidth saturation manifests as constant throughput regardless of data locality. Reading scattered versus sequential memory should differ dramatically; if it doesn't, you've hit the memory wall.

Interactive: Profiling workflow

Step 1 of 5

Establish Baseline

Measure current performance without any changes. Run multiple frames to get stable averages.

// Collect 100 frames of timing data
const samples: number[] = [];
for (let i = 0; i < 100; i++) {
  const duration = await measureFrame();
  samples.push(duration);
}
const baseline = average(samples);

Result

Baseline: 12.4ms average, 15.2ms 99th percentile

Systematic profiling catches problems that intuition misses. Measure, don't guess.

Practical Profiling Tips

Run multiple iterations and take averages. GPU timings vary due to thermal throttling, power management, and scheduling. A single measurement tells you almost nothing.

Disable vsync and frame rate limits when profiling. You want to measure raw GPU capability, not time spent waiting for the next display refresh.

Profile on target hardware. Integrated GPUs, discrete GPUs, and mobile GPUs have wildly different characteristics. Code that runs well on your development machine might struggle on users' devices.

Watch for timestamp query overhead. The queries themselves have non-zero cost. For very fast passes (under 0.1ms), the measurement overhead may distort results.

When Timestamps Aren't Available

Not all browsers and devices support timestamp queries. Always check for the feature and provide a fallback:

let canMeasure = false;
 
const adapter = await navigator.gpu.requestAdapter();
if (adapter?.features.has('timestamp-query')) {
  device = await adapter.requestDevice({
    requiredFeatures: ['timestamp-query'],
  });
  canMeasure = true;
} else {
  device = await adapter.requestDevice();
}

typescript

Without timestamps, you can still get approximate measurements using queue.onSubmittedWorkDone():

const start = performance.now();
device.queue.submit([commandBuffer]);
await device.queue.onSubmittedWorkDone();
const end = performance.now();
// This includes CPU-GPU latency, but gives rough GPU time

typescript

This measures total round-trip time, not pure GPU execution, but it's better than nothing for identifying major bottlenecks.

Key Takeaways

CPU-side timing cannot measure GPU work accurately due to asynchronous execution
Timestamp queries record timing on the GPU itself, giving nanosecond precision
Use timestampWrites in pass descriptors and resolveQuerySet to extract data
Profile multiple passes to build a complete picture of where frame time is spent
Run multiple iterations and profile on target hardware for reliable measurements
Check for timestamp-query feature support and provide fallbacks when unavailable