Sentiment Analysis
Inferring tone from text
Automatically determining the emotional tone (positive, negative, neutral) of text — the foundation of customer review, social media, and feedback analysis.
Sentiment analysis quantifies the writer's attitude. The simplest form is three-way classification (positive / negative / neutral); richer setups score intensity (-1 to +1), multiple dimensions (joy, anger, surprise), or aspect-based sentiment (positive about delivery, negative about price within the same review).
Approaches evolved:
- Lexicon-based (VADER, AFINN): each word has a preset sentiment score; sentence score = sum/average. Fast, explainable; weak on negation, irony, context. - Classical ML (Naive Bayes, logistic regression + TF-IDF): a classifier trained on labeled reviews. Solid baseline. - Pre-transformer deep learning (LSTM, CNN): captures context; needs data. - Transformer-based (BERT, RoBERTa, multilingual): today's default. Take a pre-trained model and fine-tune on 1K–10K labels for production-grade accuracy.
Open models exist for many languages. Label quality is what drives real performance. Morphology, irony, slang, regional use are perennial challenges in sentiment.
Like a restaurant owner walking the floor after dinner service. From customers' faces, leftovers, snippets servers caught, and tip size, they form an overall sense of the night. Nobody explicitly says "it was good/bad" — they read cues. Sentiment analysis quantifies those cues in text.
A shop processes 50K reviews a day; manual reading is impossible. Three models:
| Model | Accuracy | 1K reviews | Notes | |-------|----------|------------|-------| | VADER (lexicon) | 72% | <1 s | Misses irony | | Logistic + TF-IDF | 81% | 2 s | Decent baseline | | Fine-tuned BERT | 91% | 30 s | Catches context, negation |
Decision: BERT for the daily dashboard, VADER for live tooltip in customer service — speed traded for quality. Reviews where star count contradicts text (3 stars but mostly positive language) go to a manual queue; data quality matters more than tool choice.
from transformers import pipeline
clf = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
)
reviews = [
"Loved the product, fast shipping too.",
"Awful, would never recommend.",
"It's okay, not as good as I hoped.",
]
for r in reviews:
out = clf(r)[0]
print(f"{out['label']:8s} ({out['score']:.2f}) → {r}")from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
pipe = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1, 2),
min_df=2,
max_features=50000,
)),
("clf", LogisticRegression(max_iter=1000, C=1.0)),
])
pipe.fit(texts_train, labels_train)
preds = pipe.predict(texts_test)
print(classification_report(labels_test, preds))
feats = pipe.named_steps["tfidf"].get_feature_names_out()
coefs = pipe.named_steps["clf"].coef_[0]
top_pos = coefs.argsort()[-15:][::-1]
print("Most positive words:", [feats[i] for i in top_pos])- Scaling reviews / social / feedback analysis
- Brand monitoring, early crisis detection
- Content moderation (toxic vs neutral)
- Measuring emotional response in A/B tests
- Few one-off texts — read them yourself
- Sarcasm / irony heavy — even modern models struggle
- Highly domain-specific (legal, medical) — generic models miss nuance
Losing negation
'not bad' is positive; 'not good' is negative. Lexicon methods often miss this; transformers handle it better but verify on your data.
Poor label quality
If sarcasm was labeled 'positive', the model learns that mistake. Track inter-annotator agreement; refine guidelines if it's low.
Single-number summaries
An overall 78% positive can hide a feature with 30% positive / 50% negative. Aspect-based sentiment surfaces what product needs.