Sentiment Analysis of COVID-19 Post-Vaccination Discourse in Bangladesh
Classical ML, ensemble classification and LSTM with Word2Vec embeddings applied to Bengali / English-mixed social-media chatter — best 78.8% accuracy.
Public sentiment moves vaccination rates. We collected ~42k social-media comments posted in Bangladesh between January and December 2022 — a language-mixed stream of Bengali, romanised Bengali (Banglish), and English — and asked a simple question: can we measure it cleanly enough that a policymaker would act on it?
The corpus
Scraped from public threads on Facebook, YouTube and Twitter using the official APIs, filtered to comments mentioning at least one of 18 vaccine-related keywords. Each comment was hand-labelled by three annotators into three classes: positive, negative, neutral. Inter-annotator agreement (Krippendorff’s ) was 0.71 — workable, but not great, which foreshadows the upper bound we hit.
| Class | Count | Share |
|---|---|---|
| Positive | 14,872 | 35.5% |
| Negative | 16,041 | 38.3% |
| Neutral | 10,981 | 26.2% |
| Total | 41,894 | 100% |
A real challenge: 47% of comments mix two scripts within a single sentence (e.g. vaccine ta khub kharap — “the vaccine is very bad”). Standard English tokenisers fall over on these. We built a small transliteration step that normalises Banglish to one of the two source scripts before downstream processing.
Preprocessing pipeline
- Strip URLs, mentions, emoji.
- Normalise Banglish → Bengali script via a rule-based mapper.
- Tokenise (
bnlp_toolkitfor Bengali,nltkfor English). - Remove a custom stop-word list tuned to vaccine discourse.
- Lemmatise (English) / stem (Bengali).
Models
We tried four families and compared on the same train / test split (80/20, stratified by class).
Classical baselines
A Logistic Regression on TF-IDF vectors and a Multinomial Naive Bayes — fast, interpretable, surprisingly hard to beat on short text.
Ensembles
Random Forest and an XGBoost stack on the same TF-IDF features.
Deep learning
LSTM with Word2Vec embeddings trained on a 1.4M-comment unlabelled Bengali corpus. The math is the usual one — for an input sequence the LSTM cell at step computes:
The final hidden state feeds a softmax head over the three classes. Training: Adam, , dropout 0.3, early stopping on validation loss with patience 5.
Results
Per-class performance (best model)
Training curves
What the model gets wrong
Two failure modes dominate:
- Sarcasm. Comments like “great, another miracle vaccine that doesn’t work” flip on the word “great” and our model leans positive.
- Code-switching with sentiment-bearing English. A negative Bengali sentence with a positive English interjection (“vaccine ta nojor lage, but actually great move by govt”) confuses the embedding-level signal.
A transformer with a Bengali pretrain (BanglaBERT) would likely close most of this gap; on a small held-out audit it lands at 82.4% on the same split. The decision to ship LSTM was a deployment cost call — at inference time LSTM is ~12× cheaper.
What this is good for
- Policy briefings — class share over time, by region, by topic.
- Early-warning — a 10% week-over-week negative jump on a single keyword cluster (e.g. side-effect) is a real signal worth a press response.
- Targeted public-health messaging — the negative class clusters cleanly into three sub-themes (efficacy, safety, religious objection); each wants a different counter-message.
The dataset is open and the code lives on GitHub. A follow-up paper on BanglaBERT vs LSTM is in progress.