K-means
Partition data into K clusters
Splits the data into K clusters by assigning each point to the nearest centroid and iteratively updating centroids — fast and simple.
K-means is the workhorse of clustering. K is given by you. The algorithm picks K random points as initial centroids, assigns each example to the nearest one, recomputes each cluster's mean, reassigns, repeats — until centroids stop moving. The iteration is Lloyd's algorithm.
The objective minimized is the within-cluster sum of squared distances (WCSS). This implies clusters are roughly spherical; long, thin clusters can't be captured. That's why density-based (DBSCAN) and hierarchical alternatives exist.
K-means is sensitive to initialization. K-means++ init (which picks initial centers spread apart) plus n_init (running multiple seeds and keeping the best) are now standard.
You moved to a new city and want to open K=4 cafés. You start with four random spots; every customer goes to the nearest. Then you move each café to the geographic center of its customers. Customers redistribute, you move again. After a few rounds, the locations stabilize — the city is split into four service zones. That's K-means.
An e-commerce shop runs RFM analysis: Recency, Frequency, Monetary. 1.5M customers in a 3D space. Marketing says "give us 5 segments".
K=5 K-means yields: - Champions: recent, frequent, high-spend → don't oversell - Loyal: moderate frequency and value → loyalty program - At risk: last purchase 90+ days ago → win-back campaign - New: joined last week → welcome flow - Sleeping: dormant 1+ year → low-effort reactivation
No labels were provided; the algorithm produced structure, marketing named it.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
X_scaled = StandardScaler().fit_transform(X)
# Elbow + silhouette to pick K
results = []
for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
results.append({
"k": k,
"inertia": km.inertia_,
"silhouette": silhouette_score(X_scaled, labels),
})
for r in results:
print(f"K={r['k']:2d} inertia={r['inertia']:.0f} silhouette={r['silhouette']:.3f}")
best_k = max(results, key=lambda r: r["silhouette"])["k"]
final = KMeans(n_clusters=best_k, n_init=10, random_state=42).fit(X_scaled)- You have a sense of the right K from the business
- Clusters are roughly spherical and similarly sized
- Speed matters at scale — K-means is one of the fastest clustering algos
- Repeated, low-latency clustering
- Clusters are irregular shapes or very different densities — use DBSCAN
- Outliers wreck the fit — K-means tries to swallow them into clusters
- K is unknown and you need the data to suggest it — try DBSCAN/HDBSCAN
Skipping scaling
If features have different scales, the largest dominates. StandardScaler before everything.
Single random start
K-means is sensitive to init. n_init=10+ runs and keeps the best result.
Eyeballing K
Picking K=5 because it's a nice number is meaningless. Use the elbow method, silhouette score, or gap statistic to choose.