Profiling and Timestamps
Measuring GPU performance accurately
Optimizing GPU code without measurement is guesswork. You might optimize code that accounts for 1% of your frame time while ignoring the actual bottleneck. WebGPU provides timestamp queries to measure exactly how long GPU operations take, giving you the data needed to make informed decisions.
The Measurement Problem
GPU execution is asynchronous. When you call queue.submit(), the work doesn't happen immediately—it gets scheduled for later execution. This means you can't simply wrap GPU calls with performance.now() and expect meaningful results.
// This measures nothing useful
const start = performance.now();
queue.submit([commandBuffer]);
const end = performance.now();
// end - start ≈ 0.01ms (just the CPU time to enqueue)The CPU returns almost immediately after submitting work. The GPU might not even start executing for several milliseconds, and the actual computation takes longer still. To measure GPU time, you need timestamps recorded on the GPU itself.
Timestamp Queries
WebGPU provides the timestamp-query feature for GPU-side timing. You create a query set, write timestamps at specific points in your command buffer, and later read the results back to the CPU.
// Check if timestamps are supported
const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has('timestamp-query')) {
console.log('Timestamp queries not supported');
}
// Request the feature when creating the device
const device = await adapter.requestDevice({
requiredFeatures: ['timestamp-query'],
});Once you have a device with timestamp support, create a query set to hold the timestamp values:
const querySet = device.createQuerySet({
type: 'timestamp',
count: 2, // Start and end timestamps
});
// Buffer to resolve query results into
const queryBuffer = device.createBuffer({
size: 2 * 8, // 2 timestamps × 8 bytes each (BigInt64)
usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC,
});
// Buffer readable by CPU
const readBuffer = device.createBuffer({
size: 2 * 8,
usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
});Interactive: How timestamp queries work
Step through to see how timestamps capture GPU execution time independent of CPU submission timing.
Writing Timestamps
You write timestamps at the beginning and end of passes using the timestampWrites option:
const encoder = device.createCommandEncoder();
const computePass = encoder.beginComputePass({
timestampWrites: {
querySet,
beginningOfPassWriteIndex: 0,
endOfPassWriteIndex: 1,
},
});
computePass.setPipeline(pipeline);
computePass.setBindGroup(0, bindGroup);
computePass.dispatchWorkgroups(64, 64);
computePass.end();
// Resolve timestamps into buffer
encoder.resolveQuerySet(querySet, 0, 2, queryBuffer, 0);
// Copy to mappable buffer
encoder.copyBufferToBuffer(queryBuffer, 0, readBuffer, 0, 16);
device.queue.submit([encoder.finish()]);Render passes work the same way:
const renderPass = encoder.beginRenderPass({
colorAttachments: [...],
timestampWrites: {
querySet,
beginningOfPassWriteIndex: 0,
endOfPassWriteIndex: 1,
},
});Reading Results
After the GPU finishes executing and you've resolved the query set, map the read buffer to access the timestamps:
await readBuffer.mapAsync(GPUMapMode.READ);
const data = new BigInt64Array(readBuffer.getMappedRange());
const startTime = data[0];
const endTime = data[1];
const durationNanoseconds = Number(endTime - startTime);
const durationMs = durationNanoseconds / 1_000_000;
console.log(`Pass took ${durationMs.toFixed(3)}ms`);
readBuffer.unmap();Timestamps are in nanoseconds, giving you microsecond-level precision. For typical frame budgets measured in milliseconds, this precision reveals performance characteristics invisible to coarser measurements.
Interactive: Measuring GPU operation time
Each run shows slight variation—this is normal GPU behavior. Average multiple runs for reliable measurements.
Multiple Passes
Real applications have many passes—shadow maps, G-buffer, lighting, post-processing. Create a larger query set to timestamp each one:
const querySet = device.createQuerySet({
type: 'timestamp',
count: 10, // 5 passes × 2 timestamps each
});
// Shadow pass: indices 0, 1
// G-buffer pass: indices 2, 3
// Lighting pass: indices 4, 5
// ...After resolving all queries, you get a complete breakdown of where time is spent:
Interactive: Performance metrics breakdown
Hover over timeline segments to see pass names. The largest segment indicates where optimization effort yields the biggest returns.
Finding Bottlenecks
Raw timing data becomes useful when you identify patterns. Common scenarios:
Memory-bound workloads show little improvement when you reduce arithmetic complexity. If cutting your shader math in half doesn't halve execution time, you're waiting on memory.
Compute-bound workloads scale linearly with operation count. Simplifying the math directly reduces execution time.
Synchronization issues appear as gaps between passes. If pass B starts long after pass A ends, something is forcing serialization.
Bandwidth saturation manifests as constant throughput regardless of data locality. Reading scattered versus sequential memory should differ dramatically; if it doesn't, you've hit the memory wall.
Interactive: Profiling workflow
Establish Baseline
Measure current performance without any changes. Run multiple frames to get stable averages.
// Collect 100 frames of timing data
const samples: number[] = [];
for (let i = 0; i < 100; i++) {
const duration = await measureFrame();
samples.push(duration);
}
const baseline = average(samples);Systematic profiling catches problems that intuition misses. Measure, don't guess.
Practical Profiling Tips
Run multiple iterations and take averages. GPU timings vary due to thermal throttling, power management, and scheduling. A single measurement tells you almost nothing.
Disable vsync and frame rate limits when profiling. You want to measure raw GPU capability, not time spent waiting for the next display refresh.
Profile on target hardware. Integrated GPUs, discrete GPUs, and mobile GPUs have wildly different characteristics. Code that runs well on your development machine might struggle on users' devices.
Watch for timestamp query overhead. The queries themselves have non-zero cost. For very fast passes (under 0.1ms), the measurement overhead may distort results.
When Timestamps Aren't Available
Not all browsers and devices support timestamp queries. Always check for the feature and provide a fallback:
let canMeasure = false;
const adapter = await navigator.gpu.requestAdapter();
if (adapter?.features.has('timestamp-query')) {
device = await adapter.requestDevice({
requiredFeatures: ['timestamp-query'],
});
canMeasure = true;
} else {
device = await adapter.requestDevice();
}Without timestamps, you can still get approximate measurements using queue.onSubmittedWorkDone():
const start = performance.now();
device.queue.submit([commandBuffer]);
await device.queue.onSubmittedWorkDone();
const end = performance.now();
// This includes CPU-GPU latency, but gives rough GPU timeThis measures total round-trip time, not pure GPU execution, but it's better than nothing for identifying major bottlenecks.
Key Takeaways
- CPU-side timing cannot measure GPU work accurately due to asynchronous execution
- Timestamp queries record timing on the GPU itself, giving nanosecond precision
- Use
timestampWritesin pass descriptors andresolveQuerySetto extract data - Profile multiple passes to build a complete picture of where frame time is spent
- Run multiple iterations and profile on target hardware for reliable measurements
- Check for
timestamp-queryfeature support and provide fallbacks when unavailable