Post

Image Operations - The Framework Behind Vision Models

Core image operations that form the framework between backbone architectures and specialized applications

Image Operations - The Framework Behind Vision Models

Introduction

Between the raw backbone (CNN, ViT) and the specialized application (YOLO for detection, LoFTR for matching, ArcFace for recognition) lies a framework of image operations that every modern vision model relies on. These operations are neither the architecture itself nor the task-specific head—they are the connective tissue: how images are diagnosed, transformed, scaled, normalized, and combined.

Just as time series analysis has the Dickey-Fuller test for stationarity and ARMA diagnostic statistics, computer vision has its own diagnostic tests, contrast definitions, and quality metrics. Understanding these operations and their statistical foundations matters because they determine what every downstream task can achieve.


0. Image Representation Foundations

Before any operation, an image must be represented numerically.

Pixel Space

A digital image is a tensor $I \in \mathbb{R}^{H \times W \times C}$:

  • $H$: height (rows)
  • $W$: width (columns)
  • $C$: channels (typically 3 for RGB, 1 for grayscale)

Each pixel value is typically in $[0, 255]$ (uint8) or $[0, 1]$ (float32 after normalization).

Color Spaces

Color SpaceChannelsPropertiesUse Case
RGBRed, Green, BlueAdditive, perceptually non-uniformDefault for neural networks
HSVHue, Saturation, ValueDecouples color from brightnessColor-based filtering
LABLightness, A, BPerceptually uniformColor matching, segmentation
YUVLuminance, ChrominanceCompression-friendlyVideo processing
GrayscaleSingle intensityNo color informationEdge detection, OCR

Practical Tip: Train on RGB unless task is color-invariant. For augmentation diversity, convert to HSV temporarily, modify hue/saturation, convert back.

Input Normalization

Raw pixel values $[0, 255]$ are unsuitable for gradient descent. Standard normalization:

\[\hat{x} = \frac{x - \mu}{\sigma}\]

where $\mu, \sigma$ are dataset statistics (e.g., ImageNet: $\mu = [0.485, 0.456, 0.406]$, $\sigma = [0.229, 0.224, 0.225]$).


1. Image Statistics and Diagnostic Tests

Just as time series has Dickey-Fuller for stationarity, computer vision has diagnostic tests for image quality, distribution, and properties. These tests determine whether an image is suitable for a task before any model is trained.

1.1 Contrast: Mathematical Definitions

Contrast measures how distinguishable elements are from their surroundings. Multiple formal definitions exist, each suited to different scenarios.

Michelson Contrast (for periodic patterns, e.g., gratings):

\[C_M = \frac{L_{max} - L_{min}}{L_{max} + L_{min}}\]

Range: $[0, 1]$. Used for sinusoidal patterns and visual perception studies.

Weber Contrast (for small features on uniform background):

\[C_W = \frac{L - L_b}{L_b}\]

where $L$ is feature luminance, $L_b$ is background luminance. Range: $[-1, \infty)$.

RMS Contrast (for natural images):

\[C_{RMS} = \sqrt{\frac{1}{HW}\sum_{i=1}^H \sum_{j=1}^W (I[i,j] - \bar{I})^2}\]

This is the standard deviation of pixel intensities. Most useful for general image analysis.

Threshold Table for RMS Contrast (8-bit grayscale):

RMS ContrastInterpretationVisual QualityRecommended Action
< 10Very lowWashed out, near uniformHistogram stretching or equalization
10 – 30LowFaded, low detailContrast enhancement (CLAHE)
30 – 60NormalGood visibilityProceed with standard processing
60 – 90HighSharp, detailedPossibly over-saturated
> 90Very highHarsh transitionsTone-mapping may be needed

1.2 Histogram-Based Statistics

The histogram $h(k)$ counts pixel occurrences at intensity level $k$. From this:

Mean (average brightness): \(\mu = \frac{1}{HW}\sum_{i,j} I[i,j]\)

Variance (spread of intensities): \(\sigma^2 = \frac{1}{HW}\sum_{i,j} (I[i,j] - \mu)^2\)

Skewness (asymmetry of distribution): \(\gamma_1 = \frac{1}{HW \sigma^3}\sum_{i,j} (I[i,j] - \mu)^3\)

Kurtosis (peakedness): \(\gamma_2 = \frac{1}{HW \sigma^4}\sum_{i,j} (I[i,j] - \mu)^4 - 3\)

Entropy (information content): \(H = -\sum_{k=0}^{255} p(k) \log_2 p(k)\)

where $p(k) = h(k)/(HW)$ is the probability of intensity $k$.

Diagnostic Threshold Table:

MetricRangeInterpretationAction
Mean ($\mu$)< 50UnderexposedIncrease brightness or gamma > 1
Mean ($\mu$)100 – 150Well-exposedStandard processing
Mean ($\mu$)> 200OverexposedDecrease exposure or gamma < 1
Variance ($\sigma^2$)< 500Low contrastApply contrast enhancement
Variance ($\sigma^2$)500 – 5000NormalStandard pipeline
Skewness ($\gamma_1$)$|\gamma_1| > 1$AsymmetricConsider gamma correction
Kurtosis ($\gamma_2$)> 3Heavy tails (saturation)Tone-mapping needed
Entropy ($H$)< 5 bitsLow informationImage may be mostly uniform
Entropy ($H$)6 – 7 bitsTypical natural imageProceed normally
Entropy ($H$)> 7.5 bitsHigh information / noiseCheck for noise; denoise if needed

1.3 Image Quality Metrics

When comparing two images (original vs. processed, or model output vs. ground truth):

PSNR (Peak Signal-to-Noise Ratio):

\[\text{PSNR} = 10 \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right)\]

where $\text{MAX}_I = 255$ for 8-bit, and $\text{MSE} = \frac{1}{HW}\sum (I_1 - I_2)^2$.

SSIM (Structural Similarity Index):

\[\text{SSIM}(x, y) = \frac{(2\mu_x \mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}\]

Range: $[-1, 1]$, where 1 = identical.

LPIPS (Learned Perceptual Image Patch Similarity):

\[\text{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l}\sum_{h,w} \|w_l \odot (\phi_l(x)_{hw} - \phi_l(y)_{hw})\|_2^2\]

where $\phi_l$ are features from a pretrained network (e.g., VGG).

Quality Threshold Table:

MetricExcellentGoodAcceptablePoorFailure
PSNR (dB)> 4030 – 4025 – 3020 – 25< 20
SSIM> 0.950.85 – 0.950.70 – 0.850.50 – 0.70< 0.50
LPIPS< 0.050.05 – 0.150.15 – 0.300.30 – 0.50> 0.50

1.4 Distribution Comparison Tests

To test if two image sets come from the same distribution (training vs. test, real vs. synthetic):

Chi-Square Test on Histograms:

\[\chi^2 = \sum_{k=0}^{255} \frac{(h_1(k) - h_2(k))^2}{h_1(k) + h_2(k) + \epsilon}\]

Bhattacharyya Distance:

\[D_B = -\ln\left(\sum_{k=0}^{255} \sqrt{p_1(k) p_2(k)}\right)\]

Kolmogorov-Smirnov Test:

\[D_{KS} = \max_k |F_1(k) - F_2(k)|\]

where $F$ is the cumulative distribution.

Distribution Test Threshold Table:

TestSame DistributionSlight ShiftMajor Shift
$\chi^2$ (normalized)< 0.10.1 – 0.5> 0.5
Bhattacharyya< 0.10.1 – 0.4> 0.4
KS-statistic< 0.050.05 – 0.20> 0.20

Practical Tip: Run distribution tests between training and validation sets before training. A KS statistic > 0.2 indicates dataset shift—your model will likely overfit to training distribution. Either rebalance data or use domain adaptation.

1.5 Spatial Autocorrelation (Vision Analog of Time-Series Autocorrelation)

Just as time series has $R_{xx}[n]$, images have spatial autocorrelation measuring self-similarity at different offsets.

2D Autocorrelation:

\[R[u, v] = \sum_{i,j} I[i,j] \cdot I[i+u, j+v]\]

Moran’s I (global spatial autocorrelation):

\[I_M = \frac{N}{\sum_{i,j} w_{ij}} \cdot \frac{\sum_{i,j} w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}\]

Range: $[-1, 1]$ where:

  • $I_M > 0$: Clustered (similar values nearby)
  • $I_M = 0$: Random
  • $I_M < 0$: Dispersed (alternating pattern)

Moran’s I Threshold Table:

Moran’s IInterpretationImage Type
> 0.7Highly clusteredSmooth gradients, large objects
0.3 – 0.7Moderately clusteredNatural images, textures
-0.1 – 0.3Weak structureNoisy or random
< -0.1Anti-correlatedCheckerboard patterns, fine textures

2. Preprocessing and Postprocessing Transformations

Preprocessing transforms images before they enter the model. Postprocessing transforms model outputs. Both rely on transformations whose effects can be analyzed mathematically and visualized.

2.1 Histogram Equalization

Maps pixel intensities to spread the histogram uniformly across the range.

Mathematical Definition:

\[T(k) = \text{floor}\left((L-1) \cdot \text{CDF}(k)\right)\]

where $L = 256$ for 8-bit and $\text{CDF}(k) = \sum_{i=0}^{k} p(i)$.

Effect on histogram:

1
2
3
4
5
Before:  ▁▁▁▆█▇▅▂▁▁▁▁▁▁    Concentrated mid-range
         0    128         255

After:   ▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃    Uniform across full range
         0    128         255

Use case: Low-contrast images (e.g., medical imaging, dark scenes).

Limitation: Globally enhances contrast even where it shouldn’t (e.g., already-saturated regions become noise).

CLAHE (Contrast Limited Adaptive Histogram Equalization): Applies equalization in tiles with a contrast limit. Better for images with both bright and dark regions.

2.2 Gamma Correction

Non-linear pixel transformation:

\[I'[i,j] = 255 \cdot \left(\frac{I[i,j]}{255}\right)^\gamma\]

Transformation Curve:

1
2
3
4
5
6
7
8
9
10
11
Output (I')
   255 ┤      ┌──── γ < 1 (brightens shadows)
       │    ╱
   192 ┤  ╱      γ = 1 (identity)
       │╱     ╱
   128 ┤    ╱       γ > 1 (compresses highlights)
       │  ╱     ╱
    64 ┤╱     ╱
       │   ╱
     0 ┴─────────────
       0   128    255  Input (I)

Gamma Selection Threshold Table:

Gamma ($\gamma$)EffectUse Case
0.4 – 0.6Strong brighteningRecover dark images
0.7 – 0.9Mild brighteningSlight under-exposure
1.0IdentityNo change
1.1 – 1.5Mild darkeningSlight over-exposure
1.5 – 2.5Strong darkeningRecover bright washed images
2.2sRGB → LinearColor-accurate processing

2.3 Color Channel Analysis

Each channel can be analyzed independently. Plot channel histograms to detect issues.

Color Cast Detection:

If $\bar{R} \neq \bar{G} \neq \bar{B}$ significantly, the image has a color cast.

\[\text{Cast Score} = \frac{\max(\bar{R}, \bar{G}, \bar{B}) - \min(\bar{R}, \bar{G}, \bar{B})}{\bar{R} + \bar{G} + \bar{B}}\]

Color Cast Threshold Table:

Cast ScoreInterpretationAction
< 0.05Neutral / balancedNone
0.05 – 0.15Mild cast (tungsten, daylight)Optional white balance
0.15 – 0.30Strong cast (incandescent)Apply white balance
> 0.30Severe cast (color filter, broken WB)Strong correction needed

White Balance Correction (Gray World Assumption):

\[I'_c[i,j] = I_c[i,j] \cdot \frac{\bar{G}}{\bar{c}}, \quad c \in \{R, G, B\}\]

The assumption: the average of all pixels should be gray (equal RGB values).

2.4 Color Distribution Visualization

A 2D color distribution plot reveals image characteristics.

Hue Histogram (HSV space):

\[h_H(k) = \sum_{i,j} \mathbb{1}[H[i,j] = k]\]

Reveals dominant colors. Useful for:

  • Detecting color-themed images (sunset = red/orange dominant)
  • Quality control (expected color distribution)

Saturation-Value Joint Distribution:

A 2D plot of $(S, V)$ values reveals image character:

1
2
3
4
5
6
7
8
9
     V (Value/Brightness)
   1 ┤  ▒▒░░       ████  ← Vivid, healthy image
     │  ▒▒░░     ████
   0.5┤ ░░░░    ▓▓▓▓     ← Faded or grayscale
     │ ░░░░  ▓▓▓▓
     │ ░░░░▓▓▓▓          ← Dark but saturated
   0 ┴────────────────
     0    0.5    1
        S (Saturation)

Interpretation:

  • High S, high V: Vivid, healthy image
  • Low S, any V: Faded or grayscale-like
  • High S, low V: Dark, saturated (often noisy)

2.5 Postprocessing: Output Refinement

Model outputs often need refinement before deployment.

Non-Maximum Suppression (NMS) for Detection:

For overlapping bounding boxes, keep highest-confidence box:

\[\text{IoU}(B_1, B_2) = \frac{|B_1 \cap B_2|}{|B_1 \cup B_2|}\]

If $\text{IoU} > \tau$ (typically 0.5), suppress lower-confidence box.

Soft-NMS decays scores instead of suppressing:

\[s_i = s_i \cdot e^{-\text{IoU}(M, B_i)^2 / \sigma}\]

NMS Configuration Threshold Table:

IoU Threshold ($\tau$)EffectUse Case
0.3Aggressive suppressionSparse objects, no overlap expected
0.5StandardGeneral object detection
0.7Loose suppressionCrowded scenes, partial occlusions

Probability Calibration (Temperature Scaling):

Model probabilities are often miscalibrated. Apply:

\[p_{calibrated}(c) = \frac{e^{z_c / T}}{\sum_{c'} e^{z_{c'} / T}}\]

where $T$ is learned on validation set. $T > 1$ softens probabilities; $T < 1$ sharpens them.

Expected Calibration Error (ECE):

\[\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} |\text{acc}(B_m) - \text{conf}(B_m)|\]

where $B_m$ are confidence bins.

ECEInterpretationAction
< 0.02Well calibratedDeploy as-is
0.02 – 0.05AcceptableMonitor in production
0.05 – 0.10MiscalibratedApply temperature scaling
> 0.10Severely miscalibratedRecalibrate or retrain

Practical Tip: After training, always check ECE on validation set. If ECE > 0.05, your probabilities don’t reflect true accuracy—apply temperature scaling before deployment.


3. Feature Extraction Operations

Feature extraction transforms pixels into learned representations. The framework consists of three fundamental operation types.

3.1 Local Operations (Convolution)

Convolution applies a learned filter to local neighborhoods:

\[y[i,j] = \sum_{a,b} K[a,b] \cdot x[i+a, j+b]\]

Properties:

  • Translation equivariant (object moves → feature moves)
  • Parameter sharing (same filter across image)
  • Local receptive field (sees neighborhood only)

Convolution Variants:

OperationReceptive FieldParametersUse Case
Standard Conv$k \times k$$k^2 C_{in} C_{out}$General feature extraction
Depthwise Conv$k \times k$ per channel$k^2 C$Mobile architectures
Pointwise Conv (1×1)$1 \times 1$$C_{in} C_{out}$Channel mixing
Dilated Conv$k \times k$ with gaps$k^2 C_{in} C_{out}$Larger RF, same params
Transposed ConvInverse mapping$k^2 C_{in} C_{out}$Upsampling

3.2 Global Operations (Self-Attention)

Self-attention computes relationships between all spatial positions:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Trade-off vs Convolution:

AspectConvolutionSelf-Attention
Receptive fieldLocal, grows with depthGlobal from layer 1
Complexity$O(HW \cdot k^2 C)$$O((HW)^2 C)$
Inductive biasStrong (locality)Weak (learns relationships)
Data efficiencyHighLow (needs large pre-training)
Hardware efficiencyOptimized (cuDNN)Less optimized

3.3 Hybrid Operations

Modern architectures combine both:

  • Early layers: Convolution (local features, hardware efficient)
  • Late layers: Attention (global reasoning)
  • Examples: ConvNeXt, CoAtNet, MobileViT

4. Spatial Operations

Spatial operations control how features are scaled and arranged in space.

4.1 Downsampling

Reduce spatial resolution while increasing semantic depth.

MethodFunctionInformation LossLearnable
Max pooling$\max$Keeps strongest activationNo
Average pooling$\text{mean}$Smooths featuresNo
Strided convolutionLearned filterMinimal (learned)Yes
Blur poolingGaussian + subsampleAnti-aliasedNo

4.2 Upsampling

Increase spatial resolution—essential for segmentation, generation.

MethodHowQualityCost
Nearest neighborRepeat pixelsBlockyFree
BilinearLinear interpolationSmoothCheap
BicubicCubic interpolationSmootherModerate
Transposed convolutionLearned filterBest (with checkerboard risk)Expensive
Pixel shuffleChannel-to-space rearrangementBest (no checkerboard)Cheap

Practical Tip: Use pixel shuffle (sub-pixel convolution) over transposed convolution. It avoids checkerboard artifacts and is more parameter-efficient.

4.3 Multi-Scale Processing (Feature Pyramid Network)

Real-world objects appear at different scales. Multi-scale operations handle this.

\[P_l = \text{Conv}(C_l + \text{Upsample}(P_{l+1}))\]

This produces features at multiple resolutions, each combining:

  • Bottom-up path: Low-resolution, semantically strong
  • Top-down path: High-resolution, spatially precise

Used by: Object detection (small + large objects), segmentation (fine + coarse boundaries).


5. Channel Operations

Channels carry semantic information. Operations on the channel dimension control what features are emphasized.

5.1 Channel Mixing (1×1 Convolution)

A 1×1 convolution mixes channels at each spatial position:

\[y_c[i,j] = \sum_{c'} W[c, c'] \cdot x_{c'}[i,j]\]

Use cases:

  • Bottleneck: Reduce channels before expensive operation
  • Projection: Match channel dimensions for residual connections

5.2 Channel Attention (Squeeze-and-Excitation)

Compute per-channel importance weights:

\[s_c = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot \text{GlobalAvgPool}(x_c)))\] \[y_c = s_c \cdot x_c\]

Cost: Few parameters, ~1% computation overhead, often 1-2% accuracy gain.

5.3 Spatial Attention

Complement to channel attention—learn which spatial positions matter:

\[s[i,j] = \sigma(\text{Conv}([\text{MaxPool}_{ch}(x), \text{AvgPool}_{ch}(x)]))\]

Together, channel + spatial attention forms CBAM.


6. Normalization Operations

Normalization stabilizes activations and gradients during training.

6.1 The Normalization Family

Different operations normalize over different dimensions of the activation tensor $x \in \mathbb{R}^{N \times C \times H \times W}$:

OperationNormalizes OverDependenceUse Case
BatchNorm$(N, H, W)$ per channelBatch sizeStandard CNNs, large batch
LayerNorm$(C, H, W)$ per sampleIndependentTransformers
InstanceNorm$(H, W)$ per channel per sampleIndependentStyle transfer
GroupNorm$(H, W, G)$ per groupIndependentSmall batch, detection

6.2 Batch Size Threshold Table

BatchNorm degrades with small batches because batch statistics become noisy:

Batch SizeRecommendedWhy
1-4LayerNorm or GroupNormBatch stats unreliable
8-16GroupNormCompromise stability
32-128BatchNormStandard, well-tuned
128+BatchNormOptimal, large-scale training

Practical Tip: When fine-tuning with small batches on a model trained with BatchNorm, freeze the batch statistics rather than recomputing them.


7. Augmentation Operations

Augmentation expands the training distribution by transforming inputs.

7.1 Geometric Augmentations

Modify spatial structure: random crop, flip, rotation, scale, affine.

7.2 Photometric Augmentations

Modify pixel values: brightness, contrast, saturation, hue shift, noise injection.

7.3 Mixing Augmentations

MixUp: Linear blend \(x' = \lambda x_i + (1-\lambda) x_j\)

CutMix: Spatial paste of one image into another.

CutOut: Random masking with zeros.

Augmentation Strength Table:

MethodDiversityRealismWhen to Use
Geometric onlyLowHighSmall dataset, simple task
Geometric + PhotometricMediumHighStandard training
+ MixUp/CutMixHighMediumStrong regularization needed
AutoAugment / RandAugmentHighestVariableLarge compute budget

8. Loss Design Operations

Loss functions translate task requirements into gradients.

8.1 Pixel-Level Losses (Regression)

LossFormulaProperty
L1$|y - \hat{y}|$Robust to outliers, sharp
L2 (MSE)$(y - \hat{y})^2$Smooth, sensitive to outliers
Smooth L1HybridRobust + smooth
Charbonnier$\sqrt{(y-\hat{y})^2 + \epsilon^2}$Smooth approximation of L1

8.2 Classification Losses

LossUse Case
Cross-EntropyStandard classification
Focal LossClass imbalance
Label Smoothing CEReduce overconfidence

8.3 Metric Learning Losses

LossProperty
ContrastivePull positives, push negatives
TripletAnchor + positive + negative
ArcFace / CosFaceAngular margin on hypersphere
InfoNCESelf-supervised contrastive

8.4 Multi-Task Losses

Real systems combine losses:

\[L = \lambda_1 L_{cls} + \lambda_2 L_{box} + \lambda_3 L_{seg} + ...\]

The weights $\lambda_i$ matter as much as the losses themselves.


Why This Framework Matters

The operations above are not optional details—they determine what’s achievable:

  • Image diagnostic tests (contrast, distribution shift) reveal problems before training, saving compute
  • Feature extraction quality sets a ceiling for every downstream task
  • Spatial operations determine handling of scale, resolution, and locality
  • Channel operations decide what semantic information is emphasized
  • Normalization controls whether training is stable or diverges
  • Augmentation sets the effective data distribution the model learns from
  • Loss design translates task goals into gradients—wrong loss, wrong learning
  • Postprocessing (NMS, calibration) determines deployment quality

Specialized models (YOLO, LoFTR, ArcFace) are combinations of these operations tuned for specific tasks. Understanding the framework lets you read any vision paper, modify any architecture, and design new systems by composing operations rather than copying recipes.


References:

This post is licensed under CC BY 4.0 by the author.