Atomics and Synchronization

Atomic operations and barriers

When Threads Collide

Parallel programming promises speed through simultaneous execution. But when multiple threads modify the same memory location, the results can be disastrous.

Consider the simplest possible shared operation: incrementing a counter. Each thread reads the current value, adds one, and writes the result back. With a thousand threads running concurrently, we expect the counter to end up at 1000. Instead, we might see 847, 923, or some other seemingly random value that changes every run.

This is a race condition—the correctness of the program depends on the unpredictable timing of thread execution. Race conditions are among the most insidious bugs in parallel programming because they are intermittent, hard to reproduce, and can lie dormant for years before manifesting.

Interactive: Race Condition

Counter
0

All threads read, compute, then write simultaneously. Since they all read the same initial value (0), they all write 1. Expected: 4, Actual: usually 1.

Watch what happens when threads increment without coordination. Each thread's read-modify-write sequence can interleave with others in countless ways. Two threads might read the same value, both increment it, and both write the same result—losing an increment entirely.

The Anatomy of a Data Race

The increment operation counter++ looks atomic in source code, but it compiles to multiple instructions:

  1. Load the current value from memory
  2. Add one to the loaded value
  3. Store the result back to memory

When thread A loads value 5 and thread B also loads value 5 (before A's store), both compute 6 and store 6. The counter should be 7 but ends up as 6. This is called a lost update.

The window between load and store is a critical section where the program's state is inconsistent. Any operation that reads, modifies, and writes shared data is vulnerable to this pattern.

Atomic Operations

An atomic operation executes as an indivisible unit. No other thread can observe an intermediate state or interleave its own operations. The entire read-modify-write sequence happens atomically.

WGSL provides atomic types and operations:

// Declare an atomic variable
@group(0) @binding(0) var<storage, read_write> counter: atomic<u32>;
 
@compute @workgroup_size(256)
fn increment_safely(@builtin(global_invocation_id) global_id: vec3<u32>) {
    // This is safe - atomicAdd is indivisible
    atomicAdd(&counter, 1u);
}
wgsl

The atomicAdd function performs the load-add-store sequence as a single atomic operation. It returns the previous value, which can be useful for detecting order or managing indices.

Interactive: Atomic Counter

Counter
0
Expected
4

Each thread calls atomicAdd(), which returns the value before incrementing. No two threads receive the same value—each increment is counted.

With atomic operations, the counter reaches exactly the expected value every time. The hardware guarantees that concurrent atomic operations on the same location are serialized—each sees the result of all previous operations.

The Atomic Operations

WGSL provides a family of atomic operations, each performing a specific read-modify-write pattern:

Atomic Operations Reference

Click an operation to see details. All operations work on atomic<u32> or atomic<i32> types.

Each operation returns the value that was in the location before the modification. This old value is often crucial—for example, atomicAdd returns the index where your thread should write, because that index is now "claimed" and will not be given to another thread.

// Parallel append pattern
@group(0) @binding(0) var<storage, read_write> count: atomic<u32>;
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
 
@compute @workgroup_size(256)
fn append_if_valid(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let value = compute_something(global_id.x);
    
    if (is_valid(value)) {
        // Atomically claim an output slot
        let index = atomicAdd(&count, 1u);
        output[index] = value;
    }
}
wgsl

The atomicCompareExchangeWeak operation is the most powerful primitive. It atomically compares the current value to an expected value, and only if they match, replaces it with a new value. This enables implementing locks, lock-free data structures, and complex synchronization patterns.

// Try to acquire a simple spinlock
fn acquire_lock(lock: ptr<storage, atomic<u32>, read_write>) {
    // Keep trying until we successfully change 0 to 1
    loop {
        let result = atomicCompareExchangeWeak(lock, 0u, 1u);
        if (result.exchanged) {
            break;  // We got the lock
        }
        // Spin and try again
    }
}
 
fn release_lock(lock: ptr<storage, atomic<u32>, read_write>) {
    atomicStore(lock, 0u);
}
wgsl

When to Use Atomics

Atomics solve coordination problems but are not free. Use them when:

  1. Counting: Multiple threads increment/decrement a shared counter
  2. Histograms: Many threads update bin counts
  3. Reduction alternatives: When tree reduction is awkward, atomic accumulation works
  4. Index allocation: Threads claim unique indices for output
  5. Flags and signaling: Setting completion flags or status bits

Avoid atomics when:

  1. Uncontended access: If each thread accesses different memory, atomics add unnecessary overhead
  2. High contention: Many threads hitting the same location creates serialization bottlenecks
  3. Complex data structures: Building thread-safe containers is subtle and often slower than restructuring the algorithm

Performance Cost

Atomics are not free. Each atomic operation requires:

  1. Cache line coordination: The cache line containing the atomic variable must be exclusive to the executing thread
  2. Serialization: Concurrent atomics on the same location execute sequentially
  3. Memory traffic: Invalidation messages propagate between caches

When many threads atomically update the same location, they serialize—the parallel speedup disappears. This is called contention. A histogram with few bins and many threads will bottleneck on the popular bins.

Mitigate contention with:

  1. Local accumulation: Each workgroup maintains a local count in shared memory, then atomically adds to global once
  2. Privatization: Each thread maintains its own count, then combines at the end
  3. Binning strategies: Distribute work so different threads hit different locations
// Reducing contention with local accumulation
var<workgroup> local_count: atomic<u32>;
 
@compute @workgroup_size(256)
fn count_with_local(@builtin(local_invocation_id) local_id: vec3<u32>,
                    @builtin(workgroup_id) wg_id: vec3<u32>) {
    // Each thread increments local counter (less contention)
    if (should_count(local_id.x + wg_id.x * 256u)) {
        atomicAdd(&local_count, 1u);
    }
    
    workgroupBarrier();
    
    // One thread per workgroup updates global (much less contention)
    if (local_id.x == 0u) {
        atomicAdd(&global_count, atomicLoad(&local_count));
    }
}
wgsl

Memory Barriers

Atomics guarantee the indivisibility of individual operations, but what about the order of operations across different memory locations? Modern GPUs aggressively reorder memory accesses for performance, which can break assumptions in multi-threaded code.

Memory barriers (also called fences) enforce ordering constraints. WGSL provides two barriers:

// Wait for all shared memory writes in the workgroup to complete
workgroupBarrier();
 
// Wait for all storage buffer writes to be visible
storageBarrier();
wgsl

The workgroupBarrier() ensures that all threads in a workgroup have completed their workgroup-shared memory writes before any thread proceeds. We have seen this in reduction and scan algorithms—without the barrier, threads might read values that have not yet been written.

Interactive: Barrier Behavior

idle
Shared Memory
Empty
Threads at barrier:
0/4

Each thread writes to shared memory, then waits at the barrier. Only when all threads have written (barrier count = 4) can they safely read each other's values.

The storageBarrier() applies to storage buffer writes. It ensures writes from the current invocation are visible to subsequent reads, but does not synchronize across workgroups—only within the current invocation's observable effects.

The Barrier Rules

Using barriers correctly requires understanding their semantics:

  1. Uniform control flow: All threads in a workgroup must encounter a workgroupBarrier() together. If some threads take a branch that skips the barrier while others hit it, the behavior is undefined.
// WRONG - not all threads reach the barrier
if (local_id.x < 128u) {
    shared_data[local_id.x] = input[local_id.x];
    workgroupBarrier();  // Only half the threads reach this
}
 
// CORRECT - all threads reach the barrier
shared_data[local_id.x] = select(0.0, input[local_id.x], local_id.x < 128u);
workgroupBarrier();  // All threads reach this
wgsl
  1. Barriers do not synchronize data: A barrier ensures threads have reached a certain point, but atomics are still needed for safe concurrent modification. Barriers prevent reordering; atomics prevent lost updates.

  2. No cross-workgroup synchronization: There is no barrier that synchronizes threads in different workgroups. If workgroups need to communicate, you must use atomic operations on storage buffers, and even then, the ordering is limited.

Synchronization Patterns

Combining atomics and barriers enables sophisticated patterns:

Parallel reduction with barrier sync:

var<workgroup> shared_sum: array<f32, 256>;
 
@compute @workgroup_size(256)
fn reduce(@builtin(local_invocation_id) local_id: vec3<u32>) {
    shared_sum[local_id.x] = input[local_id.x];
    workgroupBarrier();
    
    for (var stride = 128u; stride > 0u; stride = stride >> 1u) {
        if (local_id.x < stride) {
            shared_sum[local_id.x] += shared_sum[local_id.x + stride];
        }
        workgroupBarrier();  // Ensure all threads complete before next iteration
    }
    
    if (local_id.x == 0u) {
        output[workgroup_id.x] = shared_sum[0];
    }
}
wgsl

Cooperative loading:

var<workgroup> tile: array<f32, 256>;
 
@compute @workgroup_size(16, 16)
fn process_tile(@builtin(local_invocation_id) local_id: vec3<u32>,
                @builtin(workgroup_id) wg_id: vec3<u32>) {
    // Cooperatively load a tile
    let flat_idx = local_id.y * 16u + local_id.x;
    let global_x = wg_id.x * 16u + local_id.x;
    let global_y = wg_id.y * 16u + local_id.y;
    
    tile[flat_idx] = texture_data[global_y * width + global_x];
    workgroupBarrier();  // All threads have loaded
    
    // Now all threads can read any tile element safely
    let center = tile[flat_idx];
    let left = tile[flat_idx - 1u];
    let right = tile[flat_idx + 1u];
    // ... process using neighbors
}
wgsl

The Limits of Synchronization

GPUs are not designed for fine-grained synchronization. Excessive barrier calls hurt performance—they force threads to wait, destroying the parallelism that makes GPUs fast. Excessive atomic contention serializes threads, again destroying parallelism.

The best GPU algorithms minimize synchronization by:

  1. Structuring work to avoid conflicts: Different threads process different data
  2. Batching synchronization: One barrier per major phase, not per element
  3. Using local data: Workgroup shared memory is faster than storage buffers
  4. Accepting approximate results: Sometimes a slight race is acceptable (randomized algorithms, progressive refinement)

Synchronization is a necessary tool, but it is the enemy of parallel performance. Use it judiciously.

Key Takeaways

  • Race conditions occur when multiple threads modify shared data without coordination
  • Atomic operations perform read-modify-write sequences indivisibly
  • atomicAdd, atomicMax, atomicCompareExchangeWeak and others solve common patterns
  • Atomics have performance costs—high contention serializes execution
  • Memory barriers enforce ordering: workgroupBarrier() for shared memory, storageBarrier() for storage buffers
  • All threads must reach workgroupBarrier() together—divergent control flow causes undefined behavior
  • The best parallel code minimizes synchronization through algorithmic design