Post

Encoding in Machine Learning: Designing Categorical Geometry

A geometric and statistical exploration of encoder families — from categorical mappings to latent spaces, positional signals, and semantic retrieval encoders

Encoding in Machine Learning: Designing Categorical Geometry

Introduction

Before a model learns, before a loss is minimized, before a gradient flows, a quieter decision is made.

How do we represent the world?

A categorical variable looks innocent. A column of regions. Product IDs. User segments. Device types. We call them “features” as if they were naturally numeric, as if the model were merely waiting for them to be formatted correctly.

But a category is not a number. It is a partition. It is a declaration that some observations are equivalent under a certain abstraction.

The moment we transform it, we decide something far more consequential than data formatting. We decide whether two categories are neighbors or strangers. Whether they lie on a line or float in orthogonal isolation. Whether identity matters more than frequency. Whether behavior matters more than structure. Whether similarity is imposed or allowed to emerge.

And once we decide, the model never questions that choice.

What is distance between “France” and “Germany”? Should “Premium” be twice “Standard”? Is rarity itself meaningful, or only correlation with outcome? When we collapse identity into expectation, are we modeling behavior — or leaking it? When we embed categories in dense vectors, are we discovering structure — or inventing it?

Only at the end do we name the act: to encode — from Latin in- (“into”) and codex (“a book of rules”) — is to inscribe something into a system. In machine learning, it is the transformation of information into numerical form so a model can process it.

Encoding is the first act of modeling. It is where epistemology becomes geometry.

Everything that follows — bias, variance, generalization, fairness, stability — is downstream of that act.


1. A Taxonomy of Encoder Families

Encoding is not one technique. It is a family of techniques, each answering a different question about what structure to impose and from where.

FamilyWhat it encodesGeometry imposedTypical domain
Categoricaldiscrete identityscalar or orthogonal basistabular ML
Latent Spacecompressed structurelearned manifoldgenerative models
Positionalsequence orderfrequency-based or learned offsetsTransformers, LLMs
Semanticmeaning of objects or pairsdense vector or scalar scoreNLP, retrieval

Each family operates on a different input type, serves a different modeling objective, and imposes a fundamentally different geometry.

The choice of encoder family is not a preprocessing detail — it defines what the model is allowed to know about the world.

Encoding determines whether categories become collinear scalars, orthogonal axes, empirical expectations, or learned manifolds.


2. Categorical Encoders

Categorical encoders address the most classical problem: a finite set of labels must become numbers.

Formally, they define a mapping from a discrete set $\mathcal{C}$ into some Euclidean space:

\[f : \mathcal{C} \rightarrow \mathbb{R}^k\]

The choice of $f$ determines adjacency, distance, and orientation. Once applied, the model interacts only with the induced geometry — never again with the original set.

EncoderGeometric formKey assumption
Label1D ordered axisTotal ordering exists
One-HotOrthogonal basisCategories fully independent
OrdinalOrdered scalarUniform rank spacing
TargetConditional meanOutcome tendency defines identity
FrequencyDensity scalarPrevalence is predictive

2.1 Label Encoding

Label encoding maps each category to an integer:

\[\{c_1, \dots, c_k\} \mapsto \{0, 1, \dots, k-1\}\]
The induced metric is absolute difference $d(c_i, c_j) =f(c_i) - f(c_j)$, implying uniform spacing and total order.

Two arbitrary label assignments for the same five regions produce different tree splits at threshold $f(c) \leq 1$:

RegionEncoding AEncoding B
North03
South10
East21
West34
Central42
  • Split A (f(c) ≤ 1): {North, South} vs {East, West, Central}
  • Split B (f(c) ≤ 1): {South, East} vs {North, West, Central}

Same model, same threshold — entirely different groupings, determined solely by encoding order.

Label encoding introduces artificial ordinal structure unless rank is intrinsic to the domain.

2.2 One-Hot Encoding

One-hot encoding maps categories to canonical basis vectors $c_i \mapsto e_i \in \mathbb{R}^k$, where $e_i$ has a 1 in position $i$ and 0 elsewhere.

The Euclidean distance between any two categories is constant: $|e_i - e_j|_2 = \sqrt{2}$ for $i \neq j$. No category is closer to another — the embedding assumes complete independence.

Low cardinality (3 categories — manageable):

Observationis_Franceis_Germanyis_UK
obs_1100
obs_2010
obs_3001

High cardinality (6 categories — sparse, mostly zeros):

Observationis_FRis_DEis_UKis_ESis_ITis_PL
obs_1100000
obs_2000010
obs_3000100

With $k = 500$ product IDs, each row is 499 zeros and 1 one. Memory scales with $n \times k$. For identifiability with an intercept, one dimension must be dropped to avoid perfect collinearity.

High-cardinality one-hot encoding inflates parameter dimensionality and increases estimator variance for rare categories.

2.3 Target Encoding

Target encoding replaces identity with empirical expectation:

\[c \mapsto \mathbb{E}[Y \mid C = c]\]

The category becomes a sufficient statistic for outcome tendency. To control variance under small sample sizes, shrinkage is applied:

\[\hat{\mu}_c = \frac{n_c \mu_c + \alpha \mu}{n_c + \alpha}\]

where $\mu$ is the global mean and $\alpha$ controls regularization. This is equivalent to empirical Bayes shrinkage under a conjugate prior.

The critical risk is leakage: if $\hat{\mu}_c$ is computed using full training data, each observation’s $y$ contributes to its own encoding.

Naïve (leaky): each observation’s $y$ is in its own $\hat{\mu}_c$.

obscategoryy$\hat{\mu}_c$ (full data)self-contribution
1A1.000.75yes
2A0.500.75yes
3B0.200.20yes

Out-of-fold (correct): obs 1’s encoding is computed on folds that exclude obs 1.

obscategoryy$\hat{\mu}_c$ (excl. self)self-contribution
1A1.000.625no
2A0.500.583no
3B0.20no

Target encoding must be computed out-of-fold to prevent target leakage.

2.4 Frequency Encoding

Frequency encoding maps categories to empirical prevalence:

\[c \mapsto \frac{n_c}{N}\]

The geometry reflects statistical mass, not semantic similarity. Categories with equal frequency collapse to identical representations. Unlike target encoding, it introduces no leakage — it is independent of $Y$.


3. Latent Space Encoders

Latent space encoders do not map a predefined set to a predefined space. They learn the encoding itself from data, optimizing for a task — reconstruction, generation, or classification.

The common structure is a bottleneck: high-dimensional input is compressed into a lower-dimensional latent representation $z$, which the model must use to accomplish its objective.

Encoder typeLatent $z$Training signalKey property
Autoencoderdeterministic pointreconstruction lossunsupervised compression
VAEdistribution $(\mu, \sigma)$reconstruction + KLstructured, traversable latent space
VQ-VAEdiscrete codebook indexreconstruction + commitment lossdiscrete latent space

3.1 Autoencoders

An autoencoder frames representation learning as a reconstruction problem. It is composed of two parametric functions trained jointly:

  • Encoder $f_\theta : \mathcal{X} \rightarrow \mathcal{Z}$ — typically a stack of dense or convolutional layers that progressively reduces dimensionality down to the latent space $\mathcal{Z}$
  • Decoder $g_\phi : \mathcal{Z} \rightarrow \mathcal{X}$ — a mirrored architecture that maps back up from $\mathcal{Z}$ to the original input space

The encoder maps $x$ to a latent code $z$, the decoder reconstructs $\hat{x}$ from $z$, and the whole system is trained end-to-end to minimize reconstruction error:

\[z = f_\theta(x), \qquad \hat{x} = g_\phi(z), \qquad \mathcal{L} = \|x - g_\phi(f_\theta(x))\|^2\]

The information bottleneck — $\dim(z) \ll \dim(x)$ — is the key constraint. Since the decoder must recover $x$ from $z$ alone, the encoder must retain the most statistically informative structure and discard redundancy. This is conceptually related to truncated PCA, but the encoder is nonlinear, so the compression can exploit higher-order structure that linear projections miss.

LayerDimensionRole
Input $x$784 (e.g. 28×28 image)Raw observation
Encoder hidden layersdecreasingProgressive dimensionality reduction
Bottleneck $z$32Compressed representation
Decoder hidden layersincreasingProgressive reconstruction
Output $\hat{x}$784Reconstruction

No prior is placed on $z$. The latent space is organized however minimizes the loss — which makes autoencoders effective for anomaly detection (out-of-distribution inputs reconstruct poorly) but poorly suited for generation (arbitrary samples from $\mathcal{Z}$ decode to noise, since there is no guarantee the space between encoded points is meaningful).

Autoencoder encodings minimize reconstruction error, not predictive accuracy. A good reconstruction embedding is not necessarily a good predictive embedding.

3.2 Variational Autoencoders

A plain autoencoder has no prior over $\mathcal{Z}$. Two observations that are semantically similar may land in distant, unrelated regions of latent space, because nothing in the objective penalizes that. Sampling an arbitrary point from $\mathcal{Z}$ and decoding it produces incoherent outputs, because the decoder has only been trained on points that are direct outputs of the encoder — the complement of that set is effectively out-of-distribution.

A Variational Autoencoder (VAE) addresses this by reformulating the problem as variational inference. Rather than learning a deterministic encoding, the encoder learns a posterior distribution over latent codes. The architecture remains encoder–decoder, but the encoder now outputs the parameters of a Gaussian:

  • Encoder $f_\theta : \mathcal{X} \rightarrow (\mu, \sigma)$ — outputs a mean vector and a standard deviation vector, both in $\mathbb{R}^d$
  • Sampling — $z$ is drawn from $q_\theta(z \mid x) = \mathcal{N}(\mu_\theta(x),\, \sigma_\theta^2(x))$
  • Decoder $g_\phi : \mathcal{Z} \rightarrow \mathcal{X}$ — same role as in a plain autoencoder

Encoding to a distribution rather than a point means every input occupies a region in $\mathcal{Z}$. When the regularization term enforces overlap between those regions, the complement of the training encodings is no longer out-of-distribution — the latent space becomes dense enough that arbitrary samples decode coherently.

The reparameterization trick makes this trainable. Sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ is not differentiable — gradients cannot flow back through a stochastic node to reach $\mu$ and $\sigma$. The trick externalizes the randomness: sample auxiliary noise $\varepsilon \sim \mathcal{N}(0, I)$ independently of the parameters, then construct $z$ deterministically as:

\[z = \mu_\theta(x) + \sigma_\theta(x) \cdot \varepsilon\]

From the optimizer’s perspective, $\varepsilon$ is a fixed constant. $\mu$ and $\sigma$ are differentiable functions of the input, so gradients propagate through the entire encoder normally.

The training objective is the Evidence Lower BOund (ELBO), written as two terms:

\[\mathcal{L} = \underbrace{\|x - \hat{x}\|^2}_{\text{reconstruction}} \;+\; \underbrace{D_{\text{KL}}\!\left(q_\theta(z \mid x) \;\|\; \mathcal{N}(0,I)\right)}_{\text{latent regularization}}\]

The reconstruction term keeps the encoding informative — same pressure as a plain autoencoder. The KL term penalizes how much the encoder’s posterior $q_\theta(z \mid x)$ deviates from the prior $\mathcal{N}(0, I)$. It is zero when the posterior matches the prior exactly, and grows as they diverge. Its effect is to anchor every encoding region near the origin with bounded spread, so the latent space cannot collapse into isolated islands or grow unbounded.

The two terms are in fundamental tension: reconstruction pushes the encoder toward sharp, concentrated posteriors (more information preserved); the KL term pushes toward diffuse posteriors that all resemble the prior (more regularity). The model settles at a balance determined by the relative weighting — in practice, a scalar $\beta > 1$ on the KL term ($\beta$-VAE) can increase disentanglement at the cost of reconstruction fidelity.

PropertyAutoencoderVAE
Latent $z$deterministic pointsampled from $(\mu, \sigma)$
Latent structureunregularizedregularized toward $\mathcal{N}(0,I)$
Gradient through $z$directvia reparameterization
Sampling new pointsnot meaningfulmeaningful interpolation
Objectivereconstruction onlyreconstruction + KL divergence

Because the latent space is regularized, points sampled between two known encodings decode into plausible observations. The geometry is smooth and traversable.

The VAE encodes not a point but a region of uncertainty. This enables controlled generation and interpolation — plain autoencoders cannot do this reliably.


4. Positional Encoders

Attention mechanisms are permutation-invariant. A Transformer has no built-in notion of sequence order — it treats a sentence and a shuffled version of that sentence identically.

Positional encoders solve this by injecting a position-dependent signal into each token representation:

\[x'_t = x_t + PE(t)\]
VariantHow position entersLearnableExtrapolates to longer sequences
Sinusoidal (Vaswani 2017)additive, fixednoyes (by design)
Learned absoluteadditive lookup tableyesno
RoPEmultiplicative rotation in attentionpartiallyyes
ALiBiadditive bias on attention scoresnoyes

4.1 Sinusoidal Positional Encoding

Each position $t$ is mapped to a $d$-dimensional vector by applying sine and cosine at geometrically decreasing frequencies across dimension pairs:

\[PE_{(t,\, 2i)} = \sin\!\left(\frac{t}{10000^{2i/d}}\right), \qquad PE_{(t,\, 2i+1)} = \cos\!\left(\frac{t}{10000^{2i/d}}\right)\]

The design is a multi-scale decomposition of position. Low-index dimensions ($i$ small) have high frequency — they oscillate rapidly and resolve fine-grained positional differences between adjacent tokens. High-index dimensions ($i$ large) have low frequency — they change slowly and encode coarse positional structure across long spans. Together the $d$ dimensions uniquely identify any position up to the sequence length the frequencies can represent.

Using both sine and cosine at each frequency is deliberate: any phase shift $PE(t + \Delta)$ can be expressed as a linear transformation of $PE(t)$, making relative position a linear operation in the encoding space.

Positiondim 0 ($\sin$, high freq)dim 1 ($\cos$, high freq)dim 2 ($\sin$, low freq)dim 3 ($\cos$, low freq)
00.0001.0000.0001.000
10.8410.5400.0101.000
20.909−0.4160.0201.000
30.141−0.9900.0301.000
4−0.757−0.6540.0400.999

High-frequency dimensions (0–1) vary substantially across adjacent positions — they carry the fine-grained signal. Low-frequency dimensions (2–3) are nearly constant at these short distances and only differentiate positions at much longer range.

The inner product $PE(t)^\top PE(t’)$ depends only on the offset $t - t’$, not on absolute position. This gives the model a structural inductive bias toward relative position: the similarity between two positional encodings is a function of their distance, not their location in the sequence.

4.2 Relative and Rotary Variants

Modern large language models move position into attention rather than into token representations.

RoPE (Rotary Position Embedding) encodes position as a rotation applied to query and key vectors before computing attention. The rotation angle depends on the position difference, making attention scores inherently position-relative.

ALiBi (Attention with Linear Biases) adds a fixed negative slope to attention scores as a function of key-query distance — no vector modification required.

Both variants improve length generalization: a model trained on sequences of length 512 can be applied to sequences of length 2048 without degradation.

Positional encodings are not learned from labels — they are geometric injections. The model learns to use them; it does not learn what they are.


5. Semantic Encoders

Semantic encoders map natural language — or structured objects — into representations that reflect meaning, not just identity.

The central question they answer: how similar is A to B?

Encoder typeInputOutputSimilarity computation
Bi-encodersingle textdense vector$f(q)^\top f(d)$ (dot product)
Cross-encodertext pair $(q, d)$scalar scorefull attention over the pair
Poly-encoderquery + multiple candidatesweighted sumintermediate between the two

5.1 Bi-Encoders

A bi-encoder encodes query and document independently:

\[s(q, d) = f_\theta(q)^\top f_\theta(d)\]

Because $f_\theta(d)$ can be precomputed for all documents, retrieval scales to billions of candidates via approximate nearest neighbor search.

The constraint is strong: each representation must be independently sufficient. Cross-document reasoning is impossible — the model cannot attend to tokens in $d$ while encoding $q$.

5.2 Cross-Encoders

A cross-encoder concatenates query and document and processes the pair jointly:

\[s = f_\theta\bigl([q \,;\, \text{[SEP]} \,;\, d]\bigr)\]

Full attention operates over both inputs simultaneously, allowing every token in $q$ to interact with every token in $d$. The output is a scalar relevance score.

PropertyBi-EncoderCross-Encoder
Encoding$f(q)$, $f(d)$ separately$f([q; d])$ jointly
Precomputationyes — index documents offlineno — must recompute per pair
Latency at query timefast (ANN lookup)slow (full forward pass per pair)
Expressivitylimitedhigh
Typical rolefirst-stage retrievalsecond-stage reranking

In production retrieval systems, both are used in sequence: a bi-encoder retrieves a candidate set, a cross-encoder reranks it.

Cross-encoders do not produce embeddings — they produce scores. The encoding is not a point in space; it is a function of a pair.


6. Bias–Variance Across Encoder Families

The bias–variance decomposition applies not only to model parameters but to the encoding function itself. Each encoding choice makes structural assumptions about the data — those assumptions are a source of bias. Each encoding introduces estimation uncertainty — that uncertainty is a source of variance. The tradeoff manifests differently across families.

Categorical encoders operate on a discrete set and must impose geometry where none exists. The bias–variance exposure is determined by how aggressively they do so.

Label encoding is a zero-variance transformation — it is deterministic — but it introduces maximal structural bias for nominal data by imposing a total order. The model is forced to reason about ordinality that does not exist. One-hot encoding carries near-zero structural bias: no ordering is imposed, all categories are equidistant. The cost is variance: the parameter count scales with $k$, and for rare categories with $n_c \ll n$, the corresponding parameters are estimated from few observations. Target encoding compresses the feature to a single dimension at the cost of conflating identity with outcome statistics. The naive estimator has high variance when $n_c$ is small; regularization via shrinkage reduces this at the cost of pulling category estimates toward the global mean, introducing bias. Frequency encoding is a low-variance, high-bias estimator when categorical identity carries signal — it substitutes prevalence for meaning.

Latent space encoders do not expose bias–variance in terms of a downstream predictive task directly. Their tradeoff is internal to the representation objective.

An autoencoder minimizes reconstruction error $\mathbb{E}[|x - \hat{x}|^2]$, not predictive loss. The two are only aligned when reconstructive dimensions coincide with discriminative dimensions — which is not guaranteed. This constitutes a form of inductive bias mismatch: the model may retain high-variance but low-signal dimensions because they contribute to reconstruction, while discarding low-variance but high-signal dimensions because they are easily reconstructed from context. A VAE adds explicit regularization via the KL term, which introduces additional bias by pulling the posterior toward the prior, but reduces geometric variance in $\mathcal{Z}$ — the latent space becomes more predictable and smooth across the data manifold.

Positional encoders are, in principle, fixed functions rather than estimated parameters. Sinusoidal PE has zero variance by construction — it is a deterministic mapping with no trainable components. Its bias is determined entirely by the design choice: the functional form assumes that position structure is well-captured by a fixed multi-frequency basis. Learned absolute PE introduces variance proportional to the number of trainable position vectors and cannot generalize beyond the maximum sequence length seen during training. RoPE and ALiBi encode position as a relative offset directly in attention scores, which reduces both the bias from absolute position assumptions and the variance from length generalization failures.

Semantic encoders face a fundamental expressivity–scalability tension that maps directly onto the bias–variance axis.

A bi-encoder imposes a strong independence constraint: the query and document representations are computed separately, so no token-level interaction between them is modeled at encoding time. This is a structural bias — the model cannot represent relationships between query terms and document terms that only emerge in context. In exchange, document representations are precomputable, which eliminates variance from repeated inference. A cross-encoder removes the independence constraint entirely, allowing full attention across the concatenated pair. This eliminates the structural bias but makes per-pair inference mandatory, and the model has higher sensitivity to query–document distributional shift.

EncoderPrimary source of biasPrimary source of variance
LabelArtificial ordinal structureNone (deterministic)
One-HotNoneParameter count scales with $k$
TargetShrinkage toward global meanRare-category estimation
FrequencyConflates prevalence with signalNone (stable estimator)
AutoencoderReconstruction ≠ predictionLatent space not regularized
VAEKL pull toward priorControlled by $\beta$ weighting
Sinusoidal PEFixed frequency basisNone (deterministic)
Learned PENoneFails beyond training length
Bi-encoderIndependence assumptionNone (offline precomputation)
Cross-encoderNoneSensitive to input distribution

Encoding selection is a modeling decision with statistical consequences — not a preprocessing step that precedes modeling.


7. Model–Encoding Coupling

The appropriate encoder family depends on the model and task:

Model classEncoding familyReason
Linear modelCategorical (one-hot, target)Magnitude is interpreted directly
Tree modelCategorical (label, target)Thresholds partition the axis
MLPCategorical + learned embeddingsDense input preferred
TransformerPositional + semanticAttention requires position signal
Generative modelLatent spaceBottleneck defines generation
Retrieval systemSemantic (bi + cross)Scalability vs accuracy tradeoff

Encoding and model architecture form a coupled system. The embedding defines the geometry; the model defines transformations over that geometry.

Once geometry is fixed, learning is constrained within it.


Conclusion

Most engineers think the model is where intelligence lives.

It isn’t.

It lives in the representation.

Once a categorical variable has been embedded into $\mathbb{R}^k$, the model can only reason within that geometry. It cannot undo an imposed order. It cannot rediscover identity that was collapsed. It cannot separate categories that were merged by frequency. It cannot remove leakage that was baked into expectation.

The hypothesis space is shaped long before training begins.

And here is the uncomfortable part:

If you cannot precisely articulate the geometry your encoding imposes — the metric it defines, the assumptions it encodes, the bias it injects, the variance it amplifies — then you are not controlling your model.

You are guessing at its world.

Encoding is where domain semantics become statistical structure. If you do not understand that structure deeply, you are building systems whose reasoning you cannot fully explain — systems that make decisions about credit, hiring, medical triage, fraud detection, recommendation — based on geometries you never examined.

And if that does not make you uneasy, it should.

Because the model is not misunderstanding the data.

It is faithfully executing the geometry you gave it.

This post is licensed under CC BY 4.0 by the author.