AI Atlas
Beginner· ~2 min read#clustering#unsupervised-learning#segmentation

Clustering

Grouping by similarity, no labels

An unsupervised technique that groups similar items into clusters without any labels — used for customer segmentation, document grouping, anomaly detection.

CLUSTERINGCluster 1Cluster 2Cluster 3Groups similar items without any labels.
Definition

Clustering automatically pulls structure out of unlabeled data: similar items end up in the same group, dissimilar items in different groups. No labels are involved; the algorithm reasons only from features. It produces structural answers like "these three user groups look alike" or "these five products belong together".

Several families exist. Partitioning algorithms (K-means, K-medoids) require you to specify the cluster count and assign each point to the nearest center. Hierarchical clustering builds a tree you can cut at any level. Density-based methods (DBSCAN, HDBSCAN) treat dense regions as clusters and sparse points as noise — no need to set K. Graph-based methods (spectral clustering) find communities in a similarity graph.

Clustering quality often depends more on the features you choose than on the algorithm. There is no mathematically "right" clustering; success is judged by the business interpretation.

Analogy

A music festival with 50,000 attendees. A drone overhead spots natural crowd densities: a black-clad pack at the rock stage, families at the food court, early arrivals near the main gate. No one labeled them "groups"; location, dress, behavior pushed them into clusters. Clustering algorithms do what the drone does — surface the structure. Naming the categories is up to you.

Real-world example

A journal has 80,000 published articles. Editors want them grouped by topic, but reading 80,000 abstracts manually is impossible. You embed each abstract (e.g. with sentence-transformers), then run HDBSCAN.

The algorithm finds 47 clusters. Inspection shows one is "quantum computing", another "climate modeling", another "antibiotic resistance". Editors name them; the journal site publishes a topic map. There were no labels — the structure emerged from the data, and the user experience improved.

Code examples
scikit-learn · K-means + silhouettePython
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np

X_scaled = StandardScaler().fit_transform(X)

# Search the best K via silhouette (higher = better separated)
scores = []
for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    scores.append((k, silhouette_score(X_scaled, labels)))

for k, s in scores:
    print(f"K={k}  silhouette={s:.3f}")

best_k = max(scores, key=lambda x: x[1])[0]
print(f"Best K: {best_k}")
When to use
  • You want to discover structure in unlabeled data
  • Customer segmentation, document/product grouping
  • Anomaly detection — small isolated clusters flagged as outliers
  • Feature engineering for downstream supervised models
When not to use
  • You have explicit target labels — supervised models are more accurate
  • You can't interpret the output and have no human time to review
  • Data is tiny or too uniform to actually cluster
Common pitfalls

Wrong distance metric

K-means uses Euclidean. For categorical data or text embeddings, cosine similarity is usually the right choice. A mismatched metric produces meaningless clusters.

Curse of dimensionality

In hundreds of dimensions distance loses meaning. Reducing dimensions with PCA/UMAP first often multiplies clustering quality.

Picking a single K and stopping

Different K values tell different business stories. Use silhouette, gap statistic, the elbow method to compare; gather feedback from product.