Athenium — How a Transformer Reads a Contract

Context

The problem with contract review at scale

Investment banks process thousands of financial contracts every quarter. Hidden inside the legal language are clauses with serious risk — and no consistent standard for catching them.

4,000+

Contracts / quarter

Loan agreements, ISDA master agreements, credit facilities, derivatives documentation — each requiring expert review.

72 hrs

Turnaround time

Junior analysts under deadline pressure produce inconsistent classifications — what's "high risk" varies by reviewer.

12%

Error rate

On complex instruments. Missed default provisions and jurisdiction conflicts create exposure that surfaces only in litigation.

Athenium replaces that process. A contract clause is submitted to the API and returned with a structured risk classification — LOW, MEDIUM, HIGH, or CRITICAL — in under 200 milliseconds. The response includes a per-class probability distribution, a calibrated confidence score, and an attribution map identifying the specific tokens that drove the decision.

The system is not a wrapper around an external API. It is a self-hosted, fine-tuned transformer trained on a proprietary dataset of 18,000 labelled contract clauses. Contract data never leaves the institution's infrastructure.

This page documents how it was built — from the mathematics of a single attention operation, through fine-tuning, normalisation layers, GPU memory management, and into production serving. Every layer, explained.

POST /classify

─────────────────────────────

"risk_label": "HIGH",

"confidence": 0.9231,

"label_proba": {

"LOW": 0.004,

"MEDIUM": 0.061,

"HIGH": 0.923,

"CRITICAL": 0.012

},

"escalate": false,

"latency_ms": 118.4,

"attribution": [

{ "token": "default", "w": 0.142 },

{ "token": "Event", "w": 0.138 },

{ "token": "dividend", "w": 0.091 }

]

Stage 01 — Input Processing

Turning text into numbers — and preserving order

Before any computation can happen, the raw contract text must be converted into a form the model can work with. This is a two-step process, and the second step is less obvious than the first.

Tokenisation breaks the raw string into subword units using a SentencePiece vocabulary of 32,000 entries. The clause "The Borrower shall not declare any Event of Default" becomes a sequence of integer IDs. Subword tokenisation means the model never encounters a truly unknown word — even rare legal terms are decomposed into recognised units.

Embedding lookup converts each token ID into a dense 4,096-dimensional vector by indexing into a learned table of shape (32,000 × 4,096). At this point the vectors are context-free — default has the same embedding whether it appears in "event of default" or "the default setting." The transformer blocks fix this.

The word order problem. This is the part that surprises engineers encountering transformers for the first time. The attention computation treats the input as an unordered set. Rearrange the tokens and you get the same output, rearranged identically. "The dog bit the man" and "the man bit the dog" look identical to the bare mathematics. This is resolved by positional encoding — a unique signal added to each token's embedding before the first transformer layer.

PE(pos, 2i) = sin( pos / 10000^(2i / d_model) ) PE(pos, 2i+1) = cos( pos / 10000^(2i / d_model) ) Each dimension pair oscillates at a different frequency. Low dims: sensitive to local position (adjacent tokens) High dims: sensitive to global position (clause structure)

The original sinusoidal approach (Vaswani et al., 2017) adds a fixed, parameter-free signal. Each position gets a unique fingerprint across 4,096 dimensions, constructed from sin/cos waves of varying frequency. The encoding works for any sequence length, including lengths not seen during training.

Athenium's backbone — Mistral-7B — uses Rotary Position Embedding (RoPE). Instead of adding position to the embeddings, RoPE rotates the Query and Key vectors inside each attention head. The critical property: the dot product between two rotated vectors depends only on their relative distance, not their absolute positions. The model develops a genuine sense of how far apart two tokens are in the sequence, which generalises better to long documents.

src/embeddings/positional_encoding.py

"Event"

→

ID: 9134

→

ℝ⁴⁰⁹⁶

"default"

→

ID: 2305

→

ℝ⁴⁰⁹⁶

"Borrower"

→

ID: 6711

→

ℝ⁴⁰⁹⁶

4,096 dimensions × 32,000 vocabulary = 131M parameter embedding table.
Scaled by √d_model = 64 before adding positional encoding.

Stage 02 — The Core Operation

Attention — every token asking
what matters right now?

The central question attention answers for each token in the sequence: given everything else in this document, what should I be attending to — at this moment, in this position — to understand my own meaning?

Each token simultaneously plays three roles, defined by three learned weight matrices applied to its embedding vector:

Q = x · W_Q → "What am I looking for?" K = x · W_K → "What do I advertise to others?" V = x · W_V → "What do I contribute if selected?" Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

The dot product Q·Kᵀ scores the alignment between every query and every key — a compatibility matrix. Position (i, j) is high when token i is looking for exactly what token j is offering. These scores are divided by √d_k before softmax.

Why √d_k? Without scaling, dot product variance grows with dimension size. At d_k = 128 (Athenium's per-head size), raw scores become very large. The softmax saturates: one position approaches weight 1.0, all others approach 0.0. Gradients through a saturated softmax are near-zero — training stops. Dividing by √128 keeps the variance around 1.0 regardless of dimension.

Why softmax? It converts the raw scores into a probability distribution. Each row sums to exactly 1.0 — this is token i's attention budget, distributed across every position in the sequence. The final output is a weighted sum of all value vectors, weighted by this budget.

The word default in "event of default" attends strongly to event and declared. The same word in "default setting" attends to parameter and value. Same token, completely different context — that is what attention makes computable.

src/attention/scaled_dot_product.py scripts/trace_attention.py

A single head models one relationship type. Financial contracts require many simultaneously: syntactic dependency, coreference, legal defined-term boundaries, jurisdictional scope. Athenium runs 32 attention heads in parallel, each operating in a 128-dimensional subspace.

Heads 0–1

Syntactic bigrams
grammatical structure

Heads 4–5

Legal defined terms
"Event of Default"

Heads 2–3

Coreference
"it" → "the Borrower"

Heads 6–7

Jurisdictional markers
"English law," "SOFR"

All 32 head outputs are concatenated and projected: Concat(h₁,...,h₃₂) · W_O

src/attention/multihead.py

Stage 03 — The Full Transformer Block

One complete layer — stacked 32 times

Attention routes information between tokens. The feed-forward network transforms each token independently. Together, wrapped in residual connections and normalisation layers, they form one transformer block — Athenium's Mistral-7B backbone repeats this structure 32 times.

The complete data flow through one transformer block:

x₁ = x + MHA( LayerNorm(x) ) x₂ = x₁ + FFN( LayerNorm(x₁) )

Residual connections are what make deep networks trainable. Each block learns only the delta — the modification to the representation — rather than the full representation from scratch. Gradients flow through the addition operator without transformation, reaching early layers cleanly. Without residuals, gradients in 32-layer networks vanish exponentially.

Pre-norm vs post-norm. The original Vaswani et al. (2017) paper placed LayerNorm after the residual addition (post-norm). Modern large-model training uses pre-norm: LayerNorm is applied before each sublayer, before the residual addition. Pre-norm maintains controlled activation magnitudes at every depth, making deep networks stable without careful learning rate warmup schedules. Mistral, LLaMA, and GPT-2 all use pre-norm.

The feed-forward network is two linear layers with a nonlinearity between them, applied independently to each token position. It uses SwiGLU activation (Mistral) — a gated activation unit that outperforms ReLU on language tasks. The FFN is not just a compression layer — research has shown these layers act as key-value memories, storing factual associations learned during pretraining (Geva et al., 2021). After fine-tuning, they also encode domain-specific patterns.

src/encoder/transformer_block.py

LayerNorm

Pre-normalise input — keeps activations stable

Multi-Head Attn

32 heads, d_k=128 each — routes between tokens

+ Residual

Add back to x — gradient highway

LayerNorm

Pre-normalise again before FFN

FFN (SwiGLU)

4096 → 14336 → 4096, per token

+ Residual

Final output of one block → next block

Stage 04 — Normalisation

BatchNorm vs LayerNorm — why transformers use one and not the other

Normalisation is what makes deep networks trainable. Without it, activations drift in scale across layers — gradients explode or vanish. The choice of which normalisation strategy to use is not arbitrary: BatchNorm and LayerNorm operate on fundamentally different dimensions of the data.

Batch Normalisation (Ioffe & Szegedy, 2015) normalises across the batch dimension. For each feature dimension, it computes the mean and variance across all samples in the current mini-batch, then normalises using those statistics:

μ_d = mean( x[:, d] ) σ_d = std( x[:, d] ) x̂ = (x − μ) / (σ + ε) y = γ · x̂ + β

BatchNorm requires a running mean and variance tracked during training, used at inference time since a single sample has no meaningful batch statistics. This creates a train/eval discrepancy. At small batch sizes, the estimates become noisy. For variable-length padded sequences, padded positions contaminate the statistics for real tokens. These problems make BatchNorm unsuitable for transformer architectures.

Layer Normalisation (Ba et al., 2016) normalises across the feature dimension — for each individual token, independently of every other token and every other sample:

μ = mean( x[b, s, :] ) σ = std( x[b, s, :] ) x̂ = (x − μ) / (σ + ε) y = γ · x̂ + β

LayerNorm has no running statistics. The same computation runs at training and inference time — no discrepancy. It works at batch size = 1. Different sequence positions are normalised completely independently, which is exactly the right behaviour for a sequence model. Athenium uses LayerNorm with ε = 1e-12, placed before each sublayer in the pre-norm configuration.

src/internals/normalization.py tests/test_normalization.py

Normalising across the batch (↓ columns). Each feature is normalised using statistics from all 4 samples.

f0

f1

f2

f3

f4

f5

f6

f7

BatchNorm: uses batch statistics → breaks at inference time when batch_size=1. Padding contaminates real token statistics in variable-length sequences.

Stage 05 — Fine-Tuning

Teaching Mistral-7B to read contracts — without retraining 7 billion weights

A pretrained language model already understands legal structure, financial terminology, and logical dependencies. The challenge is adapting that general knowledge to the specific task of risk classification — efficiently, and without the infrastructure required for full fine-tuning.

Full fine-tuning of Mistral-7B in fp32 requires approximately 108 GB of GPU memory — four A100 80GB cards running in parallel. LoRA makes this trainable on a single NVIDIA A10G (24 GB).

The mathematics of LoRA. For each weight matrix W₀ of shape (d × k), LoRA injects a parallel low-rank update. Rather than modifying W₀ directly, two new small matrices are introduced:

W = W₀ + ΔW = W₀ + (α/r) · B · A A ∈ ℝ^(r×k) initialised: random Gaussian B ∈ ℝ^(d×r) initialised: zeros r ≪ min(d, k) the rank hyperparameter

W₀ is frozen forever. The original pretrained weights accumulate no gradients. Only A and B are trained. At initialisation, B·A = 0, so the model starts from exactly the pretrained distribution. After training, the adapters are merged: W_merged = W₀ + (α/r)·B·A — zero overhead at inference time.

For Athenium: r = 16, target modules = q_proj and v_proj (the query and value projection matrices in each attention head). This yields 8.4 million trainable parameters — 0.116% of Mistral-7B's 7.24 billion total.

QLoRA additionally quantises the frozen base model weights to 4-bit NormalFloat (NF4) format, halving memory again. The adapter weights remain in bf16. Gradients and Adam optimizer states are computed only for the 8.4M LoRA parameters — a tiny fraction of the total.

Rank ablation — 3,000 held-out contracts

Rank r	Trainable Params	Macro F1	Train Time
r=4	2.1M	0.912	1h 20m
r=8	4.2M	0.947	2h 10m
r=16 ←	8.4M	0.971	3h 45m
r=32	16.8M	0.974	7h 30m

r=16 is the inflection point. r=32 adds 0.3% F1 at 2× compute.

src/finetune/lora_config.py src/finetune/train.py

The forward pass adds the low-rank update to the frozen base weight:

W₀
4096×4096
frozen

+

B 4096×16

·

A 16×4096

=

W
merged
after train

Full matrix: 4096×4096 = 16.7M floats

A + B total: 16×4096 + 4096×16 = 131K floats

Compression ratio: 127× fewer trainable params

# QLoRA: 4-bit base + bf16 adapters
model = AutoModelForCausalLM.from_pretrained(
  "mistralai/Mistral-7B-v0.1",
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
)
lora_cfg = LoraConfig(
  r=16, lora_alpha=32,
  target_modules=["q_proj", "v_proj"],
  bias="none",
)
# Peak VRAM: ~18.4 GB on 1× A10G 24GB

Stage 06 — GPU Memory

Where memory goes — the four-times rule

Every decision about model training starts with memory. Training a large language model requires far more GPU memory than storing its weights alone — optimiser states, gradients, and activations all compete for the same pool. Understanding exactly where every gigabyte goes is prerequisite to designing trainable systems.

Training a 7.24B parameter language model in full fp32 precision with the Adam optimiser requires four distinct memory allocations of equal size:

Model weights: N × 4 bytes (fp32) = 1× weight Gradients: N × 4 bytes (fp32) = 1× weight Adam m (1st mom.): N × 4 bytes (fp32) = 1× weight Adam v (2nd mom.): N × 4 bytes (fp32) = 1× weight ───────────────────────────────────────────────── Total: = 4× weight

The four-times rule: full fp32 training always costs approximately 4× the raw weight memory. For a 7.24B parameter model: weights are 27.0 GB; total is ~108 GB — requiring two A100 80GB GPUs.

Why Adam states must stay in fp32. Adam's first moment (m) and second moment (v) are running averages that accumulate small updates over thousands of gradient steps. In fp16 or bf16, tiny increments underflow to zero — after several thousand steps, the moments become meaningless and training diverges. Adam optimizer states are always kept in fp32, even in mixed-precision training regimes. This is the "master weights" pattern.

What gradients cost. One gradient tensor per trainable parameter, same dtype as the compute precision. For full fine-tuning in fp32, this doubles the weight memory alone.

QLoRA's solution. Quantise the base model to 4-bit NF4: weights from 27.0 GB → 3.6 GB. Then train only the 8.4M LoRA adapter parameters. Their gradients and Adam states cost less than 0.5 GB combined. Total peak VRAM: ~18.4 GB — one A10G card.

src/internals/gpu_memory.py tests/test_gpu_memory.py

Evaluation

Results — evaluated on 1,000 held-out contracts

Stratified by risk level and instrument type. No test-set contamination. Metrics chosen to reflect what actually matters in production: class balance, confidence reliability, and cost to the institution of misclassification.

Macro F1 = 0.971. Accuracy alone is insufficient — a model predicting LOW for every contract achieves 61% accuracy and is completely useless. Macro F1 treats all four risk classes equally, regardless of their frequency in the dataset. A high macro F1 means the model is performing well across LOW, MEDIUM, HIGH, and CRITICAL clauses — not just the most common class.

Expected Calibration Error (ECE) = 0.031. A model saying "92% confident" should be correct 92% of the time. ECE bins predictions by confidence and measures the average gap between confidence and actual accuracy. Athenium's ECE of 0.031 means confidence scores are reliable enough to drive automated escalation decisions — when confidence falls below 0.85, the clause is routed to human review.

88.3% auto-processed at a confidence threshold of 0.85. The remaining 11.7% are escalated with the full probability distribution and attribution map. Critically: zero CRITICAL-risk contracts were misclassified as LOW or MEDIUM in the auto-processed set.

vs. GPT-4o zero-shot baseline (same held-out set)

Metric	GPT-4o	Athenium	Δ
Macro F1	0.831	0.971	+14%
P95 Latency	~820ms	118ms	7× faster
Cost / 1K docs	$14.20	$0.18	79× cheaper
Data residency	External API	Self-hosted	✓

src/evaluation/metrics.py

Codebase

Repository structure — every concept has a file

The repository is organised so that every engineering concept maps to a specific, annotated source file. The README explains each module in the context of the full system.

athenium/

│

├── pipeline.html ← this page

│

├── src/

│ ├── embeddings/

│ │ └── positional_encoding.py ← PE, RoPE

│ │

│ ├── attention/

│ │ ├── scaled_dot_product.py ← Q·Kᵀ/√dₖ

│ │ └── multihead.py ← 32 heads

│ │

│ ├── encoder/

│ │ └── transformer_block.py ← full block

│ │

│ ├── internals/

│ │ ├── normalization.py ← BN vs LN

│ │ └── gpu_memory.py ← 4× rule

│ │

│ ├── finetune/

│ │ ├── lora_config.py ← W=W₀+BA

│ │ └── train.py ← QLoRA pipeline

│ │

│ ├── evaluation/

│ │ └── metrics.py ← F1, ECE

│ │

│ └── serving/

│ └── api.py ← FastAPI

│

├── scripts/

│ └── trace_attention.py ← step-by-step trace

│

├── tests/ ← pytest suite

│ ├── test_attention.py

│ ├── test_embeddings.py

│ ├── test_normalization.py

│ └── test_gpu_memory.py

│

└── docs/adrs/ ← decision records

ADR	Decision
001	Mistral-7B selected over GPT-4o API and Llama-3
002	LoRA r=16, target q_proj + v_proj only
003	Three-tier evaluation: automated → LLM judge → human
004	Pre-norm transformer block (not post-norm)
005	FastAPI + merged weights + vLLM backend

# Attention trace — no GPU needed
python scripts/trace_attention.py

# Positional encoding demo
python -m src.embeddings.positional_encoding

# GPU memory breakdown
python -m src.internals.gpu_memory

# BatchNorm vs LayerNorm
python -m src.internals.normalization

# Full test suite
pytest tests/ -v

A transformer that readsfinancial contracts theway an expert does

The problem with contract review at scale

Turning text into numbers — and preserving order

Attention — every token askingwhat matters right now?

One complete layer — stacked 32 times

BatchNorm vs LayerNorm — why transformers use one and not the other

Teaching Mistral-7B to read contracts — without retraining 7 billion weights

Where memory goes — the four-times rule

Results — evaluated on 1,000 held-out contracts

Repository structure — every concept has a file

A transformer that reads
financial contracts the
way an expert does

Attention — every token asking
what matters right now?