Image Processing

GPU image filters

Images as Textures

An image is a 2D array of color values—pixels arranged in rows and columns. On the GPU, images become textures: memory organized for efficient 2D access with hardware-accelerated sampling and filtering. When we process an image, we read from one texture and write the transformed result to another.

This texture-to-texture pattern is the foundation of GPU image processing. Each pixel's output depends only on the input texture (plus some parameters), making the work embarrassingly parallel. A 4K image has over 8 million pixels, but the GPU can process them all simultaneously.

The basic structure is a fragment shader that samples the input and computes the output:

@fragment
fn filter(@location(0) uv: vec2<f32>) -> @location(0) vec4<f32> {
  let input = textureSample(image, sampler, uv);
  let output = transform(input);
  return output;
}
wgsl

We render a fullscreen quad, and the fragment shader runs once per output pixel. This is the simplest and most common pattern for image filters.

Per-Pixel Operations

The simplest filters transform each pixel independently. Brightness multiplies all color channels. Contrast pushes colors away from middle gray. Saturation controls the intensity of colors relative to grayscale.

Interactive: Per-Pixel Filters

Original
Loading...
Filtered
Loading...

Per-pixel operations: each pixel is transformed independently based on its own color values.

Per-pixel operations are fast because they have perfect memory access patterns—each thread reads one texel and writes one pixel. No neighbor sampling, no synchronization, no complexity. The math is direct:

// Brightness: multiply
color *= brightness;
 
// Contrast: scale around midpoint
color = (color - 0.5) * contrast + 0.5;
 
// Saturation: interpolate toward grayscale
let gray = dot(color.rgb, vec3(0.299, 0.587, 0.114));
color.rgb = mix(vec3(gray), color.rgb, saturation);
wgsl

Color space conversions—RGB to HSV, to Lab, to YUV—enable different kinds of adjustments. Hue rotation is natural in HSV. Perceptual uniformity comes from Lab. Luminance-chrominance separation lives in YUV. The GPU handles these conversions per-pixel without breaking a sweat.

Convolution Filters

Many useful filters depend not just on a pixel's own value but on its neighbors. Blur averages nearby pixels. Sharpening emphasizes differences between a pixel and its surroundings. Edge detection finds where brightness changes rapidly.

These are convolution filters. A small matrix of weights—the kernel—slides over the image. At each position, the kernel multiplies each covered pixel by the corresponding weight, sums the results, and produces the output value.

Interactive: Convolution Kernels

Kernel Weights
Result
Loading...

Blue = positive weights, Red = negative weights. The kernel slides over every pixel, multiplying neighbors by weights and summing.

The kernel defines the filter's character. A box blur kernel has equal weights—all neighbors contribute equally. A Gaussian blur weights the center heavily and falls off with distance, producing smoother results. Edge detection kernels have positive weights on one side and negative on the other, responding to directional changes.

In shader code, convolution means sampling multiple texels:

@fragment
fn convolve(@location(0) uv: vec2<f32>) -> @location(0) vec4<f32> {
  let texel = 1.0 / vec2<f32>(textureDimensions(image));
  
  var sum = vec4(0.0);
  
  // 3×3 kernel
  sum += textureSample(image, s, uv + vec2(-1, -1) * texel) * kernel[0][0];
  sum += textureSample(image, s, uv + vec2( 0, -1) * texel) * kernel[0][1];
  sum += textureSample(image, s, uv + vec2( 1, -1) * texel) * kernel[0][2];
  sum += textureSample(image, s, uv + vec2(-1,  0) * texel) * kernel[1][0];
  sum += textureSample(image, s, uv + vec2( 0,  0) * texel) * kernel[1][1];
  sum += textureSample(image, s, uv + vec2( 1,  0) * texel) * kernel[1][2];
  sum += textureSample(image, s, uv + vec2(-1,  1) * texel) * kernel[2][0];
  sum += textureSample(image, s, uv + vec2( 0,  1) * texel) * kernel[2][1];
  sum += textureSample(image, s, uv + vec2( 1,  1) * texel) * kernel[2][2];
  
  return sum / kernel_sum;
}
wgsl

Larger kernels (5×5, 7×7, or bigger) produce stronger effects but cost more texture samples. A 7×7 kernel requires 49 samples per pixel—not catastrophic, but worth optimizing for frequently-used filters.

Separable Filters

Some kernels can be factored into two 1D passes. A 2D Gaussian, for instance, is the outer product of two 1D Gaussians. Instead of sampling a 7×7 area (49 samples), we can blur horizontally (7 samples), then blur vertically on the result (7 samples). Same visual result, 14 samples instead of 49.

Interactive: Separable Filter Optimization

Naive (49 samples)
Loading...
Horizontal Only
Loading...
Sample Count Comparison
Naive 2D kernel:
7 × 7 = 49
Separable (2 passes):
7 + 7 = 14

Separable filters split a 2D operation into two 1D passes, reducing complexity from O(n²) to O(n).

The right image shows only horizontal blur. A real separable blur would apply vertical blur in a second pass.

The savings scale dramatically with kernel size. A 15×15 kernel drops from 225 to 30 samples. This is why large blurs are always implemented as separable passes.

The two-pass approach requires an intermediate texture: write the horizontal blur to a temporary buffer, then read from that buffer for the vertical pass. This ping-pong pattern appears throughout GPU image processing.

// Pass 1: Horizontal blur
@fragment
fn blur_h(@location(0) uv: vec2<f32>) -> @location(0) vec4<f32> {
  var sum = vec4(0.0);
  for (var i = -radius; i <= radius; i++) {
    let offset = vec2(f32(i) * texel.x, 0.0);
    sum += textureSample(input, s, uv + offset) * weights[i + radius];
  }
  return sum;
}
 
// Pass 2: Vertical blur on intermediate result
@fragment
fn blur_v(@location(0) uv: vec2<f32>) -> @location(0) vec4<f32> {
  var sum = vec4(0.0);
  for (var i = -radius; i <= radius; i++) {
    let offset = vec2(0.0, f32(i) * texel.y);
    sum += textureSample(intermediate, s, uv + offset) * weights[i + radius];
  }
  return sum;
}
wgsl

Box blur is separable. Gaussian blur is separable. Many commonly used filters share this property. Learning to recognize and exploit separability is key to efficient image processing.

Chaining Filters

Real image processing pipelines string multiple filters together. Color correction, noise reduction, sharpening, vignette—each filter transforms the output of the previous one.

Interactive: Filter Chain

Original
Loading...
Filtered (1 passes)
Loading...
Filter Chain
1Grayscale

Order matters. Grayscale then Sepia differs from Sepia then Grayscale. Each filter modifies the previous output.

Order matters. Sharpening before noise reduction amplifies noise. Desaturating before adjusting contrast produces different results than the reverse. The pipeline's sequence is part of its design.

Each filter in the chain can be a separate render pass, writing to an intermediate texture that the next filter reads. Alternatively, simple per-pixel filters can be combined into a single shader—one pass that applies brightness, then contrast, then saturation, using the same temporary variable:

var color = textureSample(image, s, uv);
color.rgb *= brightness;
color.rgb = (color.rgb - 0.5) * contrast + 0.5;
let gray = dot(color.rgb, vec3(0.299, 0.587, 0.114));
color.rgb = mix(vec3(gray), color.rgb, saturation);
return color;
wgsl

This fusion eliminates intermediate texture writes for consecutive per-pixel operations. Only when a filter needs the results of neighbors (convolution) or operates on full-resolution data (histogram analysis) must we introduce a pass boundary.

Compute Shaders for Image Processing

Fragment shaders are natural for image filters: one invocation per pixel, automatic texture coordinate interpolation, straightforward output. But compute shaders offer advantages for certain patterns.

Shared memory enables cooperative loading. A workgroup can load a tile of pixels into shared memory once, then all threads in the workgroup can read neighbors without repeated texture fetches. For large convolution kernels, this reduces memory bandwidth significantly.

Compute shaders also handle non-standard output patterns. Histogram computation, where many pixels contribute to a small output, does not fit the fragment shader model. Scatter operations, where one input affects multiple outputs, require atomics or clever data structures—tasks better suited to compute.

var<workgroup> tile: array<array<vec4<f32>, 18>, 18>;
 
@compute @workgroup_size(16, 16)
fn convolve_tiled(@builtin(local_invocation_id) local_id: vec3<u32>,
                  @builtin(workgroup_id) group_id: vec3<u32>) {
  // Cooperatively load 18×18 tile (16×16 + 1-pixel border)
  let tile_origin = group_id.xy * 16 - 1;
  // ... loading logic ...
  
  workgroupBarrier();
  
  // Each thread convolves from shared memory
  var sum = vec4(0.0);
  for (var dy = 0; dy < 3; dy++) {
    for (var dx = 0; dx < 3; dx++) {
      sum += tile[local_id.y + dy][local_id.x + dx] * kernel[dy][dx];
    }
  }
  
  // Write output
  textureStore(output, vec2<i32>(group_id.xy * 16 + local_id.xy), sum);
}
wgsl

The tiled approach shines for large kernels and repeated operations. For simple filters, fragment shaders remain simpler and equally fast.

Key Takeaways

  • Images on the GPU are textures; processing is texture-to-texture transformation
  • Per-pixel operations (brightness, contrast, saturation) are trivially parallel and fast
  • Convolution filters sample neighbors using a weighted kernel
  • Separable filters split 2D kernels into two 1D passes, reducing samples from O(n²) to O(n)
  • Filter chains apply multiple transformations in sequence; order affects results
  • Fuse consecutive per-pixel operations into one pass; separate when neighbors are needed
  • Compute shaders with shared memory optimize large kernels via cooperative tile loading