Anomaly Detection
Spotting the unusual
Automatically finding rare, unexpected examples in data — the backbone of fraud detection, server monitoring, health anomalies, network intrusion.
Anomaly detection focuses on rare, off-pattern items in mostly-normal data. Unlike standard classification, the abnormal class is extremely rare (sometimes <0.1%), labels are unreliable, and you cannot enumerate every kind of anomaly in advance. Specialized approaches are needed.
Three strategy families:
- Statistical: z-scores, modified z-scores, IQR rules. Univariate, simple, explainable. Fails in high dimensions. - Unsupervised ML: Isolation Forest, One-Class SVM, LOF, DBSCAN noise labels. Great for high-dim data, no labels required. - Semi/supervised: if some labels exist, use class weighting, SMOTE, focal loss with gradient boosting. - Deep learning: autoencoder reconstruction error, VAE, GAN-based approaches. Strong for images, audio, sequences.
Crucially: "anomaly" is context-dependent. CPU at 95% is an anomaly for a web server, normal for a Friday-night game server. Without business input there's no threshold.
A customs officer at a port. 50,000 containers pass per day; most come from expected shippers carrying expected goods. The officer notices the odd one: a container marked "cotton" weighing twice the norm; a "furniture" shipment from a port no record exists for; a route with three unusual hops. They don't know all the rules ahead of time — they recognize deviations from expectation. Anomaly detection is the math behind that intuition.
A cloud provider monitors 100,000 servers per second: CPU, memory, disk, network, error rate. Which server is about to fail?
First attempt: threshold rules (CPU > 90%, errors > 100/h). Result: 5,000 alarms a day, 95% false. Alarm fatigue — everyone ignores them.
Better: an Isolation Forest computes a per-server vector of 8 features; high score if it's far from historical clusters. Add per-server baselines so anomaly = deviation from this server's normal, not absolute. Down to 50 alarms a day, 80% real. Trust restored.
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
iso = IsolationForest(
n_estimators=200,
contamination=0.01, # ~1% anomalies expected
random_state=42,
)
iso.fit(X_scaled)
labels = iso.predict(X_scaled) # -1 anomaly, 1 normal
scores = iso.score_samples(X_scaled) # lower = more anomalous
worst = scores.argsort()[:100]import torch
import torch.nn as nn
class AE(nn.Module):
def __init__(self, dim=20, latent=4):
super().__init__()
self.enc = nn.Sequential(nn.Linear(dim, 16), nn.ReLU(), nn.Linear(16, latent))
self.dec = nn.Sequential(nn.Linear(latent, 16), nn.ReLU(), nn.Linear(16, dim))
def forward(self, x):
z = self.enc(x)
return self.dec(z)
# Train on NORMAL data only
# Anomalies → high reconstruction error
model = AE().train()
for x in normal_loader:
recon = model(x)
loss = nn.functional.mse_loss(recon, x)
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.eval()
with torch.no_grad():
errors = ((model(X_test) - X_test) ** 2).mean(dim=1)
threshold = errors.quantile(0.99)
anomalies = errors > threshold- Positive class very rare (<1%), supervised models struggle
- Few or no labels — must learn 'normal' from data
- Server telemetry, fraud, network traffic, sensor data
- Need to flag novel anomaly types you haven't seen
- Balanced classes — straight classification is more efficient
- Definition of 'anomaly' isn't clear — start with business definition
- Strict explainability and the chosen model is too opaque
Threshold without business input
The 99th percentile of scores isn't 'anomalous' — it's a percentile. Without weighing alarm cost vs miss cost, no threshold.
Drift leakage
What's 'normal' in training shifts in production. Holidays, releases, seasonality move the baseline. Retrain regularly.
Single global threshold
100K servers with one threshold is meaningless — each has its own normal. Per-entity baselines + deviation are far more accurate.