Clustering — Explained

Definition

Clustering automatically pulls structure out of unlabeled data: similar items end up in the same group, dissimilar items in different groups. No labels are involved; the algorithm reasons only from features. It produces structural answers like "these three user groups look alike" or "these five products belong together".

Several families exist. Partitioning algorithms (K-means, K-medoids) require you to specify the cluster count and assign each point to the nearest center. Hierarchical clustering builds a tree you can cut at any level. Density-based methods (DBSCAN, HDBSCAN) treat dense regions as clusters and sparse points as noise — no need to set K. Graph-based methods (spectral clustering) find communities in a similarity graph.

Clustering quality often depends more on the features you choose than on the algorithm. There is no mathematically "right" clustering; success is judged by the business interpretation.

Analogy

A music festival with 50,000 attendees. A drone overhead spots natural crowd densities: a black-clad pack at the rock stage, families at the food court, early arrivals near the main gate. No one labeled them "groups"; location, dress, behavior pushed them into clusters. Clustering algorithms do what the drone does — surface the structure. Naming the categories is up to you.

Real-world example

A journal has 80,000 published articles. Editors want them grouped by topic, but reading 80,000 abstracts manually is impossible. You embed each abstract (e.g. with sentence-transformers), then run HDBSCAN.

The algorithm finds 47 clusters. Inspection shows one is "quantum computing", another "climate modeling", another "antibiotic resistance". Editors name them; the journal site publishes a topic map. There were no labels — the structure emerged from the data, and the user experience improved.

Code examples

scikit-learn · K-means + silhouettePython

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np

X_scaled = StandardScaler().fit_transform(X)

# Search the best K via silhouette (higher = better separated)
scores = []
for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    scores.append((k, silhouette_score(X_scaled, labels)))

for k, s in scores:
    print(f"K={k}  silhouette={s:.3f}")

best_k = max(scores, key=lambda x: x[1])[0]
print(f"Best K: {best_k}")

When to use

You want to discover structure in unlabeled data
Customer segmentation, document/product grouping
Anomaly detection — small isolated clusters flagged as outliers
Feature engineering for downstream supervised models

When not to use

You have explicit target labels — supervised models are more accurate
You can't interpret the output and have no human time to review
Data is tiny or too uniform to actually cluster

Common pitfalls

Wrong distance metric

K-means uses Euclidean. For categorical data or text embeddings, cosine similarity is usually the right choice. A mismatched metric produces meaningless clusters.

Curse of dimensionality

In hundreds of dimensions distance loses meaning. Reducing dimensions with PCA/UMAP first often multiplies clustering quality.

Picking a single K and stopping

Different K values tell different business stories. Use silhouette, gap statistic, the elbow method to compare; gather feedback from product.