Beginner· ~2 min read#k-means#clustering#unsupervised

K-means

Partition data into K clusters

Splits the data into K clusters by assigning each point to the nearest centroid and iteratively updating centroids — fast and simple.

Definition

K-means is the workhorse of clustering. K is given by you. The algorithm picks K random points as initial centroids, assigns each example to the nearest one, recomputes each cluster's mean, reassigns, repeats — until centroids stop moving. The iteration is Lloyd's algorithm.

The objective minimized is the within-cluster sum of squared distances (WCSS). This implies clusters are roughly spherical; long, thin clusters can't be captured. That's why density-based (DBSCAN) and hierarchical alternatives exist.

K-means is sensitive to initialization. K-means++ init (which picks initial centers spread apart) plus n_init (running multiple seeds and keeping the best) are now standard.

Analogy

You moved to a new city and want to open K=4 cafés. You start with four random spots; every customer goes to the nearest. Then you move each café to the geographic center of its customers. Customers redistribute, you move again. After a few rounds, the locations stabilize — the city is split into four service zones. That's K-means.

Real-world example

An e-commerce shop runs RFM analysis: Recency, Frequency, Monetary. 1.5M customers in a 3D space. Marketing says "give us 5 segments".

K=5 K-means yields: - Champions: recent, frequent, high-spend → don't oversell - Loyal: moderate frequency and value → loyalty program - At risk: last purchase 90+ days ago → win-back campaign - New: joined last week → welcome flow - Sleeping: dormant 1+ year → low-effort reactivation

No labels were provided; the algorithm produced structure, marketing named it.

Code examples

scikit-learn · K selection + visualizationPython

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np

X_scaled = StandardScaler().fit_transform(X)

# Elbow + silhouette to pick K
results = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    results.append({
        "k": k,
        "inertia": km.inertia_,
        "silhouette": silhouette_score(X_scaled, labels),
    })

for r in results:
    print(f"K={r['k']:2d}  inertia={r['inertia']:.0f}  silhouette={r['silhouette']:.3f}")

best_k = max(results, key=lambda r: r["silhouette"])["k"]
final = KMeans(n_clusters=best_k, n_init=10, random_state=42).fit(X_scaled)

When to use

You have a sense of the right K from the business
Clusters are roughly spherical and similarly sized
Speed matters at scale — K-means is one of the fastest clustering algos
Repeated, low-latency clustering

When not to use

Clusters are irregular shapes or very different densities — use DBSCAN
Outliers wreck the fit — K-means tries to swallow them into clusters
K is unknown and you need the data to suggest it — try DBSCAN/HDBSCAN

Common pitfalls

Skipping scaling

If features have different scales, the largest dominates. StandardScaler before everything.

Single random start

K-means is sensitive to init. n_init=10+ runs and keeps the best result.

Eyeballing K

Picking K=5 because it's a nice number is meaningless. Use the elbow method, silhouette score, or gap statistic to choose.