Clustering
Grouping by similarity, no labels
An unsupervised technique that groups similar items into clusters without any labels — used for customer segmentation, document grouping, anomaly detection.
Clustering automatically pulls structure out of unlabeled data: similar items end up in the same group, dissimilar items in different groups. No labels are involved; the algorithm reasons only from features. It produces structural answers like "these three user groups look alike" or "these five products belong together".
Several families exist. Partitioning algorithms (K-means, K-medoids) require you to specify the cluster count and assign each point to the nearest center. Hierarchical clustering builds a tree you can cut at any level. Density-based methods (DBSCAN, HDBSCAN) treat dense regions as clusters and sparse points as noise — no need to set K. Graph-based methods (spectral clustering) find communities in a similarity graph.
Clustering quality often depends more on the features you choose than on the algorithm. There is no mathematically "right" clustering; success is judged by the business interpretation.
A music festival with 50,000 attendees. A drone overhead spots natural crowd densities: a black-clad pack at the rock stage, families at the food court, early arrivals near the main gate. No one labeled them "groups"; location, dress, behavior pushed them into clusters. Clustering algorithms do what the drone does — surface the structure. Naming the categories is up to you.
A journal has 80,000 published articles. Editors want them grouped by topic, but reading 80,000 abstracts manually is impossible. You embed each abstract (e.g. with sentence-transformers), then run HDBSCAN.
The algorithm finds 47 clusters. Inspection shows one is "quantum computing", another "climate modeling", another "antibiotic resistance". Editors name them; the journal site publishes a topic map. There were no labels — the structure emerged from the data, and the user experience improved.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
X_scaled = StandardScaler().fit_transform(X)
# Search the best K via silhouette (higher = better separated)
scores = []
for k in range(2, 9):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
scores.append((k, silhouette_score(X_scaled, labels)))
for k, s in scores:
print(f"K={k} silhouette={s:.3f}")
best_k = max(scores, key=lambda x: x[1])[0]
print(f"Best K: {best_k}")- You want to discover structure in unlabeled data
- Customer segmentation, document/product grouping
- Anomaly detection — small isolated clusters flagged as outliers
- Feature engineering for downstream supervised models
- You have explicit target labels — supervised models are more accurate
- You can't interpret the output and have no human time to review
- Data is tiny or too uniform to actually cluster
Wrong distance metric
K-means uses Euclidean. For categorical data or text embeddings, cosine similarity is usually the right choice. A mismatched metric produces meaningless clusters.
Curse of dimensionality
In hundreds of dimensions distance loses meaning. Reducing dimensions with PCA/UMAP first often multiplies clustering quality.
Picking a single K and stopping
Different K values tell different business stories. Use silhouette, gap statistic, the elbow method to compare; gather feedback from product.