Support Vector Machine
SVM · Maximum-margin classifier
Separates classes with a hyperplane that maximizes the margin between them, and via the kernel trick can capture nonlinear boundaries.
An SVM looks for the boundary that not only separates the classes but leaves the widest margin on either side. The points closest to that boundary are called support vectors. Wide margins make the classifier more robust to new data, so SVM actively tries to push them.
SVM truly shines when data isn't linearly separable, thanks to the kernel trick. Instead of explicitly mapping data into a higher-dimensional space, kernels (RBF, polynomial, sigmoid) compute inner products in that space directly. The result: a curved boundary in the original space, linear in the lifted one.
SVM is highly effective on small-to-medium datasets, especially when feature count is comparable to or larger than the sample count (genomics, text classification). For very large datasets training time blows up; gradient boosting or deep learning dominate at that scale.
Imagine separating fans of two rival teams with a barrier. You don't drop it down the middle; you give both sides breathing room. If a few fans on each side stand right at the line, the barrier is "supported" by them — those are the critical observations. SVM places the barrier exactly with that logic; a small number of points define the entire model.
A bioinformatics lab analyzes 20,000 genes from 200 patients to classify "cancer present / not". Features outnumber samples by 100×, the classic p ≫ n setting. Random forest is too data-hungry, gradient boosting struggles too.
A linear-kernel SVM trains, the support vectors collapse to about 20 patients — they define the boundary, the other 180 sit in the background. Test accuracy 91%, with a clean ranking of the most influential genes. Small data, high dimensions, SVM still leads.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scaler", StandardScaler()),
("svm", SVC(probability=True)),
])
# C: margin width vs error tolerance; gamma: RBF curvature
param_grid = {
"svm__kernel": ["linear", "rbf"],
"svm__C": [0.1, 1, 10, 100],
"svm__gamma": ["scale", 0.01, 0.1],
}
gs = GridSearchCV(pipe, param_grid, cv=5, scoring="f1_macro", n_jobs=-1)
gs.fit(X_train, y_train)
print(f"Best: {gs.best_params_}, F1={gs.best_score_:.3f}")- Small-to-medium datasets (up to ~10K rows)
- Feature count near or exceeding sample count
- Nonlinear but clean class boundary (with a kernel)
- Text classification, genomics, image data — high dimensional
- Large data (>100K) — training scales quadratically or worse
- Calibrated probability output is critical — SVM needs Platt scaling
- Explainability is mandatory and you're using a nonlinear kernel
Skipping feature scaling
SVM is distance-based. If one feature spans 0–1 and another 0–1M, it dominates entirely. StandardScaler first.
Picking C and gamma blindly
Too high C overfits, too low underfits. Gamma sets RBF curvature. Grid search is essentially mandatory.
Ignoring the size warning
Above ~100K rows SVM can take hours. Linear kernel? Use liblinear or SGDClassifier. Nonlinear at scale? Try Nyström approximation.