Encoding in Machine Learning: Designing Categorical Geometry
A geometric and statistical exploration of encoder families — from categorical mappings to latent spaces, positional signals, and semantic retrieval encoders
Introduction
Before a model learns, before a loss is minimized, before a gradient flows, a quieter decision is made.
How do we represent the world?
A categorical variable looks innocent. A column of regions. Product IDs. User segments. Device types. We call them “features” as if they were naturally numeric, as if the model were merely waiting for them to be formatted correctly.
But a category is not a number. It is a partition. It is a declaration that some observations are equivalent under a certain abstraction.
The moment we transform it, we decide something far more consequential than data formatting. We decide whether two categories are neighbors or strangers. Whether they lie on a line or float in orthogonal isolation. Whether identity matters more than frequency. Whether behavior matters more than structure. Whether similarity is imposed or allowed to emerge.
And once we decide, the model never questions that choice.
What is distance between “France” and “Germany”? Should “Premium” be twice “Standard”? Is rarity itself meaningful, or only correlation with outcome? When we collapse identity into expectation, are we modeling behavior — or leaking it? When we embed categories in dense vectors, are we discovering structure — or inventing it?
Only at the end do we name the act: to encode — from Latin in- (“into”) and codex (“a book of rules”) — is to inscribe something into a system. In machine learning, it is the transformation of information into numerical form so a model can process it.
Encoding is the first act of modeling. It is where epistemology becomes geometry.
Everything that follows — bias, variance, generalization, fairness, stability — is downstream of that act.
1. A Taxonomy of Encoder Families
Encoding is not one technique. It is a family of techniques, each answering a different question about what structure to impose and from where.
| Family | What it encodes | Geometry imposed | Typical domain |
|---|---|---|---|
| Categorical | discrete identity | scalar or orthogonal basis | tabular ML |
| Latent Space | compressed structure | learned manifold | generative models |
| Positional | sequence order | frequency-based or learned offsets | Transformers, LLMs |
| Semantic | meaning of objects or pairs | dense vector or scalar score | NLP, retrieval |
Each family operates on a different input type, serves a different modeling objective, and imposes a fundamentally different geometry.
The choice of encoder family is not a preprocessing detail — it defines what the model is allowed to know about the world.
Encoding determines whether categories become collinear scalars, orthogonal axes, empirical expectations, or learned manifolds.
2. Categorical Encoders
Categorical encoders address the most classical problem: a finite set of labels must become numbers.
Formally, they define a mapping from a discrete set $\mathcal{C}$ into some Euclidean space:
\[f : \mathcal{C} \rightarrow \mathbb{R}^k\]The choice of $f$ determines adjacency, distance, and orientation. Once applied, the model interacts only with the induced geometry — never again with the original set.
| Encoder | Geometric form | Key assumption |
|---|---|---|
| Label | 1D ordered axis | Total ordering exists |
| One-Hot | Orthogonal basis | Categories fully independent |
| Ordinal | Ordered scalar | Uniform rank spacing |
| Target | Conditional mean | Outcome tendency defines identity |
| Frequency | Density scalar | Prevalence is predictive |
2.1 Label Encoding
Label encoding maps each category to an integer:
\[\{c_1, \dots, c_k\} \mapsto \{0, 1, \dots, k-1\}\]| The induced metric is absolute difference $d(c_i, c_j) = | f(c_i) - f(c_j) | $, implying uniform spacing and total order. |
Two arbitrary label assignments for the same five regions produce different tree splits at threshold $f(c) \leq 1$:
| Region | Encoding A | Encoding B |
|---|---|---|
| North | 0 | 3 |
| South | 1 | 0 |
| East | 2 | 1 |
| West | 3 | 4 |
| Central | 4 | 2 |
- Split A (
f(c) ≤ 1): {North, South} vs {East, West, Central} - Split B (
f(c) ≤ 1): {South, East} vs {North, West, Central}
Same model, same threshold — entirely different groupings, determined solely by encoding order.
Label encoding introduces artificial ordinal structure unless rank is intrinsic to the domain.
2.2 One-Hot Encoding
One-hot encoding maps categories to canonical basis vectors $c_i \mapsto e_i \in \mathbb{R}^k$, where $e_i$ has a 1 in position $i$ and 0 elsewhere.
The Euclidean distance between any two categories is constant: $|e_i - e_j|_2 = \sqrt{2}$ for $i \neq j$. No category is closer to another — the embedding assumes complete independence.
Low cardinality (3 categories — manageable):
| Observation | is_France | is_Germany | is_UK |
|---|---|---|---|
| obs_1 | 1 | 0 | 0 |
| obs_2 | 0 | 1 | 0 |
| obs_3 | 0 | 0 | 1 |
High cardinality (6 categories — sparse, mostly zeros):
| Observation | is_FR | is_DE | is_UK | is_ES | is_IT | is_PL |
|---|---|---|---|---|---|---|
| obs_1 | 1 | 0 | 0 | 0 | 0 | 0 |
| obs_2 | 0 | 0 | 0 | 0 | 1 | 0 |
| obs_3 | 0 | 0 | 0 | 1 | 0 | 0 |
With $k = 500$ product IDs, each row is 499 zeros and 1 one. Memory scales with $n \times k$. For identifiability with an intercept, one dimension must be dropped to avoid perfect collinearity.
High-cardinality one-hot encoding inflates parameter dimensionality and increases estimator variance for rare categories.
2.3 Target Encoding
Target encoding replaces identity with empirical expectation:
\[c \mapsto \mathbb{E}[Y \mid C = c]\]The category becomes a sufficient statistic for outcome tendency. To control variance under small sample sizes, shrinkage is applied:
\[\hat{\mu}_c = \frac{n_c \mu_c + \alpha \mu}{n_c + \alpha}\]where $\mu$ is the global mean and $\alpha$ controls regularization. This is equivalent to empirical Bayes shrinkage under a conjugate prior.
The critical risk is leakage: if $\hat{\mu}_c$ is computed using full training data, each observation’s $y$ contributes to its own encoding.
Naïve (leaky): each observation’s $y$ is in its own $\hat{\mu}_c$.
| obs | category | y | $\hat{\mu}_c$ (full data) | self-contribution |
|---|---|---|---|---|
| 1 | A | 1.00 | 0.75 | yes |
| 2 | A | 0.50 | 0.75 | yes |
| 3 | B | 0.20 | 0.20 | yes |
Out-of-fold (correct): obs 1’s encoding is computed on folds that exclude obs 1.
| obs | category | y | $\hat{\mu}_c$ (excl. self) | self-contribution |
|---|---|---|---|---|
| 1 | A | 1.00 | 0.625 | no |
| 2 | A | 0.50 | 0.583 | no |
| 3 | B | 0.20 | — | no |
Target encoding must be computed out-of-fold to prevent target leakage.
2.4 Frequency Encoding
Frequency encoding maps categories to empirical prevalence:
\[c \mapsto \frac{n_c}{N}\]The geometry reflects statistical mass, not semantic similarity. Categories with equal frequency collapse to identical representations. Unlike target encoding, it introduces no leakage — it is independent of $Y$.
3. Latent Space Encoders
Latent space encoders do not map a predefined set to a predefined space. They learn the encoding itself from data, optimizing for a task — reconstruction, generation, or classification.
The common structure is a bottleneck: high-dimensional input is compressed into a lower-dimensional latent representation $z$, which the model must use to accomplish its objective.
| Encoder type | Latent $z$ | Training signal | Key property |
|---|---|---|---|
| Autoencoder | deterministic point | reconstruction loss | unsupervised compression |
| VAE | distribution $(\mu, \sigma)$ | reconstruction + KL | structured, traversable latent space |
| VQ-VAE | discrete codebook index | reconstruction + commitment loss | discrete latent space |
3.1 Autoencoders
An autoencoder frames representation learning as a reconstruction problem. It is composed of two parametric functions trained jointly:
- Encoder $f_\theta : \mathcal{X} \rightarrow \mathcal{Z}$ — typically a stack of dense or convolutional layers that progressively reduces dimensionality down to the latent space $\mathcal{Z}$
- Decoder $g_\phi : \mathcal{Z} \rightarrow \mathcal{X}$ — a mirrored architecture that maps back up from $\mathcal{Z}$ to the original input space
The encoder maps $x$ to a latent code $z$, the decoder reconstructs $\hat{x}$ from $z$, and the whole system is trained end-to-end to minimize reconstruction error:
\[z = f_\theta(x), \qquad \hat{x} = g_\phi(z), \qquad \mathcal{L} = \|x - g_\phi(f_\theta(x))\|^2\]The information bottleneck — $\dim(z) \ll \dim(x)$ — is the key constraint. Since the decoder must recover $x$ from $z$ alone, the encoder must retain the most statistically informative structure and discard redundancy. This is conceptually related to truncated PCA, but the encoder is nonlinear, so the compression can exploit higher-order structure that linear projections miss.
| Layer | Dimension | Role |
|---|---|---|
| Input $x$ | 784 (e.g. 28×28 image) | Raw observation |
| Encoder hidden layers | decreasing | Progressive dimensionality reduction |
| Bottleneck $z$ | 32 | Compressed representation |
| Decoder hidden layers | increasing | Progressive reconstruction |
| Output $\hat{x}$ | 784 | Reconstruction |
No prior is placed on $z$. The latent space is organized however minimizes the loss — which makes autoencoders effective for anomaly detection (out-of-distribution inputs reconstruct poorly) but poorly suited for generation (arbitrary samples from $\mathcal{Z}$ decode to noise, since there is no guarantee the space between encoded points is meaningful).
Autoencoder encodings minimize reconstruction error, not predictive accuracy. A good reconstruction embedding is not necessarily a good predictive embedding.
3.2 Variational Autoencoders
A plain autoencoder has no prior over $\mathcal{Z}$. Two observations that are semantically similar may land in distant, unrelated regions of latent space, because nothing in the objective penalizes that. Sampling an arbitrary point from $\mathcal{Z}$ and decoding it produces incoherent outputs, because the decoder has only been trained on points that are direct outputs of the encoder — the complement of that set is effectively out-of-distribution.
A Variational Autoencoder (VAE) addresses this by reformulating the problem as variational inference. Rather than learning a deterministic encoding, the encoder learns a posterior distribution over latent codes. The architecture remains encoder–decoder, but the encoder now outputs the parameters of a Gaussian:
- Encoder $f_\theta : \mathcal{X} \rightarrow (\mu, \sigma)$ — outputs a mean vector and a standard deviation vector, both in $\mathbb{R}^d$
- Sampling — $z$ is drawn from $q_\theta(z \mid x) = \mathcal{N}(\mu_\theta(x),\, \sigma_\theta^2(x))$
- Decoder $g_\phi : \mathcal{Z} \rightarrow \mathcal{X}$ — same role as in a plain autoencoder
Encoding to a distribution rather than a point means every input occupies a region in $\mathcal{Z}$. When the regularization term enforces overlap between those regions, the complement of the training encodings is no longer out-of-distribution — the latent space becomes dense enough that arbitrary samples decode coherently.
The reparameterization trick makes this trainable. Sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ is not differentiable — gradients cannot flow back through a stochastic node to reach $\mu$ and $\sigma$. The trick externalizes the randomness: sample auxiliary noise $\varepsilon \sim \mathcal{N}(0, I)$ independently of the parameters, then construct $z$ deterministically as:
\[z = \mu_\theta(x) + \sigma_\theta(x) \cdot \varepsilon\]From the optimizer’s perspective, $\varepsilon$ is a fixed constant. $\mu$ and $\sigma$ are differentiable functions of the input, so gradients propagate through the entire encoder normally.
The training objective is the Evidence Lower BOund (ELBO), written as two terms:
\[\mathcal{L} = \underbrace{\|x - \hat{x}\|^2}_{\text{reconstruction}} \;+\; \underbrace{D_{\text{KL}}\!\left(q_\theta(z \mid x) \;\|\; \mathcal{N}(0,I)\right)}_{\text{latent regularization}}\]The reconstruction term keeps the encoding informative — same pressure as a plain autoencoder. The KL term penalizes how much the encoder’s posterior $q_\theta(z \mid x)$ deviates from the prior $\mathcal{N}(0, I)$. It is zero when the posterior matches the prior exactly, and grows as they diverge. Its effect is to anchor every encoding region near the origin with bounded spread, so the latent space cannot collapse into isolated islands or grow unbounded.
The two terms are in fundamental tension: reconstruction pushes the encoder toward sharp, concentrated posteriors (more information preserved); the KL term pushes toward diffuse posteriors that all resemble the prior (more regularity). The model settles at a balance determined by the relative weighting — in practice, a scalar $\beta > 1$ on the KL term ($\beta$-VAE) can increase disentanglement at the cost of reconstruction fidelity.
| Property | Autoencoder | VAE |
|---|---|---|
| Latent $z$ | deterministic point | sampled from $(\mu, \sigma)$ |
| Latent structure | unregularized | regularized toward $\mathcal{N}(0,I)$ |
| Gradient through $z$ | direct | via reparameterization |
| Sampling new points | not meaningful | meaningful interpolation |
| Objective | reconstruction only | reconstruction + KL divergence |
Because the latent space is regularized, points sampled between two known encodings decode into plausible observations. The geometry is smooth and traversable.
The VAE encodes not a point but a region of uncertainty. This enables controlled generation and interpolation — plain autoencoders cannot do this reliably.
4. Positional Encoders
Attention mechanisms are permutation-invariant. A Transformer has no built-in notion of sequence order — it treats a sentence and a shuffled version of that sentence identically.
Positional encoders solve this by injecting a position-dependent signal into each token representation:
\[x'_t = x_t + PE(t)\]| Variant | How position enters | Learnable | Extrapolates to longer sequences |
|---|---|---|---|
| Sinusoidal (Vaswani 2017) | additive, fixed | no | yes (by design) |
| Learned absolute | additive lookup table | yes | no |
| RoPE | multiplicative rotation in attention | partially | yes |
| ALiBi | additive bias on attention scores | no | yes |
4.1 Sinusoidal Positional Encoding
Each position $t$ is mapped to a $d$-dimensional vector by applying sine and cosine at geometrically decreasing frequencies across dimension pairs:
\[PE_{(t,\, 2i)} = \sin\!\left(\frac{t}{10000^{2i/d}}\right), \qquad PE_{(t,\, 2i+1)} = \cos\!\left(\frac{t}{10000^{2i/d}}\right)\]The design is a multi-scale decomposition of position. Low-index dimensions ($i$ small) have high frequency — they oscillate rapidly and resolve fine-grained positional differences between adjacent tokens. High-index dimensions ($i$ large) have low frequency — they change slowly and encode coarse positional structure across long spans. Together the $d$ dimensions uniquely identify any position up to the sequence length the frequencies can represent.
Using both sine and cosine at each frequency is deliberate: any phase shift $PE(t + \Delta)$ can be expressed as a linear transformation of $PE(t)$, making relative position a linear operation in the encoding space.
| Position | dim 0 ($\sin$, high freq) | dim 1 ($\cos$, high freq) | dim 2 ($\sin$, low freq) | dim 3 ($\cos$, low freq) |
|---|---|---|---|---|
| 0 | 0.000 | 1.000 | 0.000 | 1.000 |
| 1 | 0.841 | 0.540 | 0.010 | 1.000 |
| 2 | 0.909 | −0.416 | 0.020 | 1.000 |
| 3 | 0.141 | −0.990 | 0.030 | 1.000 |
| 4 | −0.757 | −0.654 | 0.040 | 0.999 |
High-frequency dimensions (0–1) vary substantially across adjacent positions — they carry the fine-grained signal. Low-frequency dimensions (2–3) are nearly constant at these short distances and only differentiate positions at much longer range.
The inner product $PE(t)^\top PE(t’)$ depends only on the offset $t - t’$, not on absolute position. This gives the model a structural inductive bias toward relative position: the similarity between two positional encodings is a function of their distance, not their location in the sequence.
4.2 Relative and Rotary Variants
Modern large language models move position into attention rather than into token representations.
RoPE (Rotary Position Embedding) encodes position as a rotation applied to query and key vectors before computing attention. The rotation angle depends on the position difference, making attention scores inherently position-relative.
ALiBi (Attention with Linear Biases) adds a fixed negative slope to attention scores as a function of key-query distance — no vector modification required.
Both variants improve length generalization: a model trained on sequences of length 512 can be applied to sequences of length 2048 without degradation.
Positional encodings are not learned from labels — they are geometric injections. The model learns to use them; it does not learn what they are.
5. Semantic Encoders
Semantic encoders map natural language — or structured objects — into representations that reflect meaning, not just identity.
The central question they answer: how similar is A to B?
| Encoder type | Input | Output | Similarity computation |
|---|---|---|---|
| Bi-encoder | single text | dense vector | $f(q)^\top f(d)$ (dot product) |
| Cross-encoder | text pair $(q, d)$ | scalar score | full attention over the pair |
| Poly-encoder | query + multiple candidates | weighted sum | intermediate between the two |
5.1 Bi-Encoders
A bi-encoder encodes query and document independently:
\[s(q, d) = f_\theta(q)^\top f_\theta(d)\]Because $f_\theta(d)$ can be precomputed for all documents, retrieval scales to billions of candidates via approximate nearest neighbor search.
The constraint is strong: each representation must be independently sufficient. Cross-document reasoning is impossible — the model cannot attend to tokens in $d$ while encoding $q$.
5.2 Cross-Encoders
A cross-encoder concatenates query and document and processes the pair jointly:
\[s = f_\theta\bigl([q \,;\, \text{[SEP]} \,;\, d]\bigr)\]Full attention operates over both inputs simultaneously, allowing every token in $q$ to interact with every token in $d$. The output is a scalar relevance score.
| Property | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Encoding | $f(q)$, $f(d)$ separately | $f([q; d])$ jointly |
| Precomputation | yes — index documents offline | no — must recompute per pair |
| Latency at query time | fast (ANN lookup) | slow (full forward pass per pair) |
| Expressivity | limited | high |
| Typical role | first-stage retrieval | second-stage reranking |
In production retrieval systems, both are used in sequence: a bi-encoder retrieves a candidate set, a cross-encoder reranks it.
Cross-encoders do not produce embeddings — they produce scores. The encoding is not a point in space; it is a function of a pair.
6. Bias–Variance Across Encoder Families
The bias–variance decomposition applies not only to model parameters but to the encoding function itself. Each encoding choice makes structural assumptions about the data — those assumptions are a source of bias. Each encoding introduces estimation uncertainty — that uncertainty is a source of variance. The tradeoff manifests differently across families.
Categorical encoders operate on a discrete set and must impose geometry where none exists. The bias–variance exposure is determined by how aggressively they do so.
Label encoding is a zero-variance transformation — it is deterministic — but it introduces maximal structural bias for nominal data by imposing a total order. The model is forced to reason about ordinality that does not exist. One-hot encoding carries near-zero structural bias: no ordering is imposed, all categories are equidistant. The cost is variance: the parameter count scales with $k$, and for rare categories with $n_c \ll n$, the corresponding parameters are estimated from few observations. Target encoding compresses the feature to a single dimension at the cost of conflating identity with outcome statistics. The naive estimator has high variance when $n_c$ is small; regularization via shrinkage reduces this at the cost of pulling category estimates toward the global mean, introducing bias. Frequency encoding is a low-variance, high-bias estimator when categorical identity carries signal — it substitutes prevalence for meaning.
Latent space encoders do not expose bias–variance in terms of a downstream predictive task directly. Their tradeoff is internal to the representation objective.
An autoencoder minimizes reconstruction error $\mathbb{E}[|x - \hat{x}|^2]$, not predictive loss. The two are only aligned when reconstructive dimensions coincide with discriminative dimensions — which is not guaranteed. This constitutes a form of inductive bias mismatch: the model may retain high-variance but low-signal dimensions because they contribute to reconstruction, while discarding low-variance but high-signal dimensions because they are easily reconstructed from context. A VAE adds explicit regularization via the KL term, which introduces additional bias by pulling the posterior toward the prior, but reduces geometric variance in $\mathcal{Z}$ — the latent space becomes more predictable and smooth across the data manifold.
Positional encoders are, in principle, fixed functions rather than estimated parameters. Sinusoidal PE has zero variance by construction — it is a deterministic mapping with no trainable components. Its bias is determined entirely by the design choice: the functional form assumes that position structure is well-captured by a fixed multi-frequency basis. Learned absolute PE introduces variance proportional to the number of trainable position vectors and cannot generalize beyond the maximum sequence length seen during training. RoPE and ALiBi encode position as a relative offset directly in attention scores, which reduces both the bias from absolute position assumptions and the variance from length generalization failures.
Semantic encoders face a fundamental expressivity–scalability tension that maps directly onto the bias–variance axis.
A bi-encoder imposes a strong independence constraint: the query and document representations are computed separately, so no token-level interaction between them is modeled at encoding time. This is a structural bias — the model cannot represent relationships between query terms and document terms that only emerge in context. In exchange, document representations are precomputable, which eliminates variance from repeated inference. A cross-encoder removes the independence constraint entirely, allowing full attention across the concatenated pair. This eliminates the structural bias but makes per-pair inference mandatory, and the model has higher sensitivity to query–document distributional shift.
| Encoder | Primary source of bias | Primary source of variance |
|---|---|---|
| Label | Artificial ordinal structure | None (deterministic) |
| One-Hot | None | Parameter count scales with $k$ |
| Target | Shrinkage toward global mean | Rare-category estimation |
| Frequency | Conflates prevalence with signal | None (stable estimator) |
| Autoencoder | Reconstruction ≠ prediction | Latent space not regularized |
| VAE | KL pull toward prior | Controlled by $\beta$ weighting |
| Sinusoidal PE | Fixed frequency basis | None (deterministic) |
| Learned PE | None | Fails beyond training length |
| Bi-encoder | Independence assumption | None (offline precomputation) |
| Cross-encoder | None | Sensitive to input distribution |
Encoding selection is a modeling decision with statistical consequences — not a preprocessing step that precedes modeling.
7. Model–Encoding Coupling
The appropriate encoder family depends on the model and task:
| Model class | Encoding family | Reason |
|---|---|---|
| Linear model | Categorical (one-hot, target) | Magnitude is interpreted directly |
| Tree model | Categorical (label, target) | Thresholds partition the axis |
| MLP | Categorical + learned embeddings | Dense input preferred |
| Transformer | Positional + semantic | Attention requires position signal |
| Generative model | Latent space | Bottleneck defines generation |
| Retrieval system | Semantic (bi + cross) | Scalability vs accuracy tradeoff |
Encoding and model architecture form a coupled system. The embedding defines the geometry; the model defines transformations over that geometry.
Once geometry is fixed, learning is constrained within it.
Conclusion
Most engineers think the model is where intelligence lives.
It isn’t.
It lives in the representation.
Once a categorical variable has been embedded into $\mathbb{R}^k$, the model can only reason within that geometry. It cannot undo an imposed order. It cannot rediscover identity that was collapsed. It cannot separate categories that were merged by frequency. It cannot remove leakage that was baked into expectation.
The hypothesis space is shaped long before training begins.
And here is the uncomfortable part:
If you cannot precisely articulate the geometry your encoding imposes — the metric it defines, the assumptions it encodes, the bias it injects, the variance it amplifies — then you are not controlling your model.
You are guessing at its world.
Encoding is where domain semantics become statistical structure. If you do not understand that structure deeply, you are building systems whose reasoning you cannot fully explain — systems that make decisions about credit, hiring, medical triage, fraud detection, recommendation — based on geometries you never examined.
And if that does not make you uneasy, it should.
Because the model is not misunderstanding the data.
It is faithfully executing the geometry you gave it.
