Built from first principles — attention implemented from scratch, LoRA fine-tuning on Mistral-7B, self-hosted inference. Every layer of the stack, explained.
Investment banks process thousands of financial contracts every quarter. Hidden inside the legal language are clauses with serious risk — and no consistent standard for catching them.
Athenium replaces that process. A contract clause is submitted to the API and returned with a structured risk classification — LOW, MEDIUM, HIGH, or CRITICAL — in under 200 milliseconds. The response includes a per-class probability distribution, a calibrated confidence score, and an attribution map identifying the specific tokens that drove the decision.
The system is not a wrapper around an external API. It is a self-hosted, fine-tuned transformer trained on a proprietary dataset of 18,000 labelled contract clauses. Contract data never leaves the institution's infrastructure.
This page documents how it was built — from the mathematics of a single attention operation, through fine-tuning, normalisation layers, GPU memory management, and into production serving. Every layer, explained.
Before any computation can happen, the raw contract text must be converted into a form the model can work with. This is a two-step process, and the second step is less obvious than the first.
Tokenisation breaks the raw string into subword units using a SentencePiece vocabulary of 32,000 entries. The clause "The Borrower shall not declare any Event of Default" becomes a sequence of integer IDs. Subword tokenisation means the model never encounters a truly unknown word — even rare legal terms are decomposed into recognised units.
Embedding lookup converts each token ID into a dense 4,096-dimensional vector by indexing into a learned table of shape (32,000 × 4,096). At this point the vectors are context-free — default has the same embedding whether it appears in "event of default" or "the default setting." The transformer blocks fix this.
The word order problem. This is the part that surprises engineers encountering transformers for the first time. The attention computation treats the input as an unordered set. Rearrange the tokens and you get the same output, rearranged identically. "The dog bit the man" and "the man bit the dog" look identical to the bare mathematics. This is resolved by positional encoding — a unique signal added to each token's embedding before the first transformer layer.
The original sinusoidal approach (Vaswani et al., 2017) adds a fixed, parameter-free signal. Each position gets a unique fingerprint across 4,096 dimensions, constructed from sin/cos waves of varying frequency. The encoding works for any sequence length, including lengths not seen during training.
Athenium's backbone — Mistral-7B — uses Rotary Position Embedding (RoPE). Instead of adding position to the embeddings, RoPE rotates the Query and Key vectors inside each attention head. The critical property: the dot product between two rotated vectors depends only on their relative distance, not their absolute positions. The model develops a genuine sense of how far apart two tokens are in the sequence, which generalises better to long documents.
The central question attention answers for each token in the sequence: given everything else in this document, what should I be attending to — at this moment, in this position — to understand my own meaning?
Each token simultaneously plays three roles, defined by three learned weight matrices applied to its embedding vector:
The dot product Q·Kᵀ scores the alignment between every query and every key — a compatibility matrix. Position (i, j) is high when token i is looking for exactly what token j is offering. These scores are divided by √d_k before softmax.
Why √d_k? Without scaling, dot product variance grows with dimension size. At d_k = 128 (Athenium's per-head size), raw scores become very large. The softmax saturates: one position approaches weight 1.0, all others approach 0.0. Gradients through a saturated softmax are near-zero — training stops. Dividing by √128 keeps the variance around 1.0 regardless of dimension.
Why softmax? It converts the raw scores into a probability distribution. Each row sums to exactly 1.0 — this is token i's attention budget, distributed across every position in the sequence. The final output is a weighted sum of all value vectors, weighted by this budget.
The word default in "event of default" attends strongly to event and declared. The same word in "default setting" attends to parameter and value. Same token, completely different context — that is what attention makes computable.
Attention routes information between tokens. The feed-forward network transforms each token independently. Together, wrapped in residual connections and normalisation layers, they form one transformer block — Athenium's Mistral-7B backbone repeats this structure 32 times.
The complete data flow through one transformer block:
Residual connections are what make deep networks trainable. Each block learns only the delta — the modification to the representation — rather than the full representation from scratch. Gradients flow through the addition operator without transformation, reaching early layers cleanly. Without residuals, gradients in 32-layer networks vanish exponentially.
Pre-norm vs post-norm. The original Vaswani et al. (2017) paper placed LayerNorm after the residual addition (post-norm). Modern large-model training uses pre-norm: LayerNorm is applied before each sublayer, before the residual addition. Pre-norm maintains controlled activation magnitudes at every depth, making deep networks stable without careful learning rate warmup schedules. Mistral, LLaMA, and GPT-2 all use pre-norm.
The feed-forward network is two linear layers with a nonlinearity between them, applied independently to each token position. It uses SwiGLU activation (Mistral) — a gated activation unit that outperforms ReLU on language tasks. The FFN is not just a compression layer — research has shown these layers act as key-value memories, storing factual associations learned during pretraining (Geva et al., 2021). After fine-tuning, they also encode domain-specific patterns.
Normalisation is what makes deep networks trainable. Without it, activations drift in scale across layers — gradients explode or vanish. The choice of which normalisation strategy to use is not arbitrary: BatchNorm and LayerNorm operate on fundamentally different dimensions of the data.
Batch Normalisation (Ioffe & Szegedy, 2015) normalises across the batch dimension. For each feature dimension, it computes the mean and variance across all samples in the current mini-batch, then normalises using those statistics:
BatchNorm requires a running mean and variance tracked during training, used at inference time since a single sample has no meaningful batch statistics. This creates a train/eval discrepancy. At small batch sizes, the estimates become noisy. For variable-length padded sequences, padded positions contaminate the statistics for real tokens. These problems make BatchNorm unsuitable for transformer architectures.
Layer Normalisation (Ba et al., 2016) normalises across the feature dimension — for each individual token, independently of every other token and every other sample:
LayerNorm has no running statistics. The same computation runs at training and inference time — no discrepancy. It works at batch size = 1. Different sequence positions are normalised completely independently, which is exactly the right behaviour for a sequence model. Athenium uses LayerNorm with ε = 1e-12, placed before each sublayer in the pre-norm configuration.
A pretrained language model already understands legal structure, financial terminology, and logical dependencies. The challenge is adapting that general knowledge to the specific task of risk classification — efficiently, and without the infrastructure required for full fine-tuning.
Full fine-tuning of Mistral-7B in fp32 requires approximately 108 GB of GPU memory — four A100 80GB cards running in parallel. LoRA makes this trainable on a single NVIDIA A10G (24 GB).
The mathematics of LoRA. For each weight matrix W₀ of shape (d × k), LoRA injects a parallel low-rank update. Rather than modifying W₀ directly, two new small matrices are introduced:
W₀ is frozen forever. The original pretrained weights accumulate no gradients. Only A and B are trained. At initialisation, B·A = 0, so the model starts from exactly the pretrained distribution. After training, the adapters are merged: W_merged = W₀ + (α/r)·B·A — zero overhead at inference time.
For Athenium: r = 16, target modules = q_proj and v_proj (the query and value projection matrices in each attention head). This yields 8.4 million trainable parameters — 0.116% of Mistral-7B's 7.24 billion total.
QLoRA additionally quantises the frozen base model weights to 4-bit NormalFloat (NF4) format, halving memory again. The adapter weights remain in bf16. Gradients and Adam optimizer states are computed only for the 8.4M LoRA parameters — a tiny fraction of the total.
| Rank r | Trainable Params | Macro F1 | Train Time |
|---|---|---|---|
| r=4 | 2.1M | 0.912 | 1h 20m |
| r=8 | 4.2M | 0.947 | 2h 10m |
| r=16 ← | 8.4M | 0.971 | 3h 45m |
| r=32 | 16.8M | 0.974 | 7h 30m |
Every decision about model training starts with memory. Training a large language model requires far more GPU memory than storing its weights alone — optimiser states, gradients, and activations all compete for the same pool. Understanding exactly where every gigabyte goes is prerequisite to designing trainable systems.
Training a 7.24B parameter language model in full fp32 precision with the Adam optimiser requires four distinct memory allocations of equal size:
The four-times rule: full fp32 training always costs approximately 4× the raw weight memory. For a 7.24B parameter model: weights are 27.0 GB; total is ~108 GB — requiring two A100 80GB GPUs.
Why Adam states must stay in fp32. Adam's first moment (m) and second moment (v) are running averages that accumulate small updates over thousands of gradient steps. In fp16 or bf16, tiny increments underflow to zero — after several thousand steps, the moments become meaningless and training diverges. Adam optimizer states are always kept in fp32, even in mixed-precision training regimes. This is the "master weights" pattern.
What gradients cost. One gradient tensor per trainable parameter, same dtype as the compute precision. For full fine-tuning in fp32, this doubles the weight memory alone.
QLoRA's solution. Quantise the base model to 4-bit NF4: weights from 27.0 GB → 3.6 GB. Then train only the 8.4M LoRA adapter parameters. Their gradients and Adam states cost less than 0.5 GB combined. Total peak VRAM: ~18.4 GB — one A10G card.
Stratified by risk level and instrument type. No test-set contamination. Metrics chosen to reflect what actually matters in production: class balance, confidence reliability, and cost to the institution of misclassification.
Macro F1 = 0.971. Accuracy alone is insufficient — a model predicting LOW for every contract achieves 61% accuracy and is completely useless. Macro F1 treats all four risk classes equally, regardless of their frequency in the dataset. A high macro F1 means the model is performing well across LOW, MEDIUM, HIGH, and CRITICAL clauses — not just the most common class.
Expected Calibration Error (ECE) = 0.031. A model saying "92% confident" should be correct 92% of the time. ECE bins predictions by confidence and measures the average gap between confidence and actual accuracy. Athenium's ECE of 0.031 means confidence scores are reliable enough to drive automated escalation decisions — when confidence falls below 0.85, the clause is routed to human review.
88.3% auto-processed at a confidence threshold of 0.85. The remaining 11.7% are escalated with the full probability distribution and attribution map. Critically: zero CRITICAL-risk contracts were misclassified as LOW or MEDIUM in the auto-processed set.
| Metric | GPT-4o | Athenium | Δ |
|---|---|---|---|
| Macro F1 | 0.831 | 0.971 | +14% |
| P95 Latency | ~820ms | 118ms | 7× faster |
| Cost / 1K docs | $14.20 | $0.18 | 79× cheaper |
| Data residency | External API | Self-hosted | ✓ |
The repository is organised so that every engineering concept maps to a specific, annotated source file. The README explains each module in the context of the full system.