Unsupervised Learning
Finding structure without labels
A learning style where the model finds hidden structure, clusters, or patterns in unlabeled data on its own.
In unsupervised learning nobody tells the model "the right answer". The model is given raw data and asked to find structure inside it: "how similar are these users?", "which transaction stands out?", "how many themes show up in this corpus?".
Three main use cases. Clustering groups similar items — customer segmentation, document organization. Dimensionality reduction compresses high-dimensional data into 2–3 dimensions a human can grasp — PCA, t-SNE, UMAP. Anomaly detection picks out the odd one in a crowd — fraud, sensor failure.
Its strength is working without labels; in the real world, 99% of data is unlabeled. Its weakness is that the model finds structure but never tells you what it means. It says "here are three clusters"; deciding they correspond to "VIPs", "discount-hunters", and "one-off buyers" is on you.
You walk into a new library. Books are stacked randomly, no labels. By cover color, thickness, font, language, you start grouping them yourself. Yellow novels pile up here, red textbooks there. No one told you the categories — you derived structure from the data. Later, when a new book arrives, you can place it instantly.
An e-commerce company knows a lot about its 2M customers: age, spend, what they buy, how often they visit, which campaigns they respond to. But there is no "type" label per customer. The marketing team can't manage 200 segments — it wants a small number of meaningful groups.
K-means clusters customers into 5 groups by similarity. Analysis follows: cluster one is "monthly visitor, high basket, premium items", cluster two is "weekly bargain hunters", cluster three is "twice-a-year big spenders", and so on. Each gets its own campaign, tone, channel. There were no labels — the model found the structure, humans gave it meaning.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
# Unlabeled customer data: spend, visit frequency, basket size
X = np.array([
[1200, 4, 350], [80, 1, 25],
[950, 3, 280], [60, 1, 30],
[1100, 5, 400], [90, 2, 35],
# ... thousands more rows
])
# Bring features to the same scale (critical for distance-based algos)
X_scaled = StandardScaler().fit_transform(X)
# Find 3 clusters
model = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = model.fit_predict(X_scaled)
# Every customer now has a cluster number
# Interpreting them is the human's job: cluster 0 = VIP, etc.
print(labels)- Labels are missing or expensive to obtain
- You're exploring: 'what's in this data?'
- Customer segmentation, document grouping, anomaly detection
- Pre-processing for supervised models: feature extraction, dim. reduction
- Success depends on a specific target (label) — use supervised
- Explainability is mandatory — cluster interpretation is subjective
- Data is scarce — discovery needs volume
Picking the wrong number of clusters
In K-means you decide 'how many clusters'. Wrong choice produces meaningless splits. Elbow method, silhouette score etc. let the data tell you the right K.
Scale differences
If one feature spans 0–1 and another 0–100,000, the larger-scale one dominates clustering. Always normalize with StandardScaler or MinMaxScaler before fitting distance-based algorithms.
Interpretation illusion
If the model finds 5 clusters, you assume 5 meaningful segments exist. Maybe 2 are random noise. Validate clusters against business context — don't run a campaign on imaginary segments.