← Journal

Sentiment Analysis of COVID-19 Post-Vaccination Discourse in Bangladesh

Classical ML, ensemble classification and LSTM with Word2Vec embeddings applied to Bengali / English-mixed social-media chatter — best 78.8% accuracy.

Public sentiment moves vaccination rates. We collected ~42k social-media comments posted in Bangladesh between January and December 2022 — a language-mixed stream of Bengali, romanised Bengali (Banglish), and English — and asked a simple question: can we measure it cleanly enough that a policymaker would act on it?

The corpus

Scraped from public threads on Facebook, YouTube and Twitter using the official APIs, filtered to comments mentioning at least one of 18 vaccine-related keywords. Each comment was hand-labelled by three annotators into three classes: positive, negative, neutral. Inter-annotator agreement (Krippendorff’s α\alpha) was 0.71 — workable, but not great, which foreshadows the upper bound we hit.

ClassCountShare
Positive14,87235.5%
Negative16,04138.3%
Neutral10,98126.2%
Total41,894100%

A real challenge: 47% of comments mix two scripts within a single sentence (e.g. vaccine ta khub kharap — “the vaccine is very bad”). Standard English tokenisers fall over on these. We built a small transliteration step that normalises Banglish to one of the two source scripts before downstream processing.

Preprocessing pipeline

  1. Strip URLs, mentions, emoji.
  2. Normalise Banglish → Bengali script via a rule-based mapper.
  3. Tokenise (bnlp_toolkit for Bengali, nltk for English).
  4. Remove a custom stop-word list tuned to vaccine discourse.
  5. Lemmatise (English) / stem (Bengali).

Models

We tried four families and compared on the same train / test split (80/20, stratified by class).

Classical baselines

A Logistic Regression on TF-IDF vectors and a Multinomial Naive Bayes — fast, interpretable, surprisingly hard to beat on short text.

Ensembles

Random Forest and an XGBoost stack on the same TF-IDF features.

Deep learning

LSTM with Word2Vec embeddings trained on a 1.4M-comment unlabelled Bengali corpus. The math is the usual one — for an input sequence x1,,xTx_1, \ldots, x_T the LSTM cell at step tt computes:

it=σ(Wixt+Uiht1+bi)ft=σ(Wfxt+Ufht1+bf)ot=σ(Woxt+Uoht1+bo)c~t=tanh(Wcxt+Ucht1+bc)ct=ftct1+itc~tht=ottanh(ct)\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \\ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \\ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \\ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}

The final hidden state hTh_T feeds a softmax head over the three classes. Training: Adam, η=103\eta = 10^{-3}, dropout 0.3, early stopping on validation loss with patience 5.

Results

Test accuracy by model
67.1 %
Naive Bayes
72.4 %
Logistic Reg.
74.2 %
Random Forest
75.6 %
XGBoost
78.8 %
LSTM + W2V
LSTM with Word2Vec wins, but the gap over a tuned XGBoost on TF-IDF is only ~3 points — for the deployment cost, that's a real trade-off.

Per-class performance (best model)

LSTM + Word2Vec — per-class F1
Positive
0.81
Negative
0.83
Neutral
0.71
The neutral class is the hardest. Most annotator disagreement happens here, and the model inherits that ambiguity.

Training curves

LSTM training & validation accuracy
90 68 45 23 0 1 3 5 10 15 20 25 Accuracy (%)
Training Validation
Validation accuracy plateaus around epoch 20; further training over-fits despite dropout.

What the model gets wrong

Two failure modes dominate:

  • Sarcasm. Comments like “great, another miracle vaccine that doesn’t work” flip on the word “great” and our model leans positive.
  • Code-switching with sentiment-bearing English. A negative Bengali sentence with a positive English interjection (“vaccine ta nojor lage, but actually great move by govt”) confuses the embedding-level signal.

A transformer with a Bengali pretrain (BanglaBERT) would likely close most of this gap; on a small held-out audit it lands at 82.4% on the same split. The decision to ship LSTM was a deployment cost call — at inference time LSTM is ~12× cheaper.

What this is good for

  • Policy briefings — class share over time, by region, by topic.
  • Early-warning — a 10% week-over-week negative jump on a single keyword cluster (e.g. side-effect) is a real signal worth a press response.
  • Targeted public-health messaging — the negative class clusters cleanly into three sub-themes (efficacy, safety, religious objection); each wants a different counter-message.

The dataset is open and the code lives on GitHub. A follow-up paper on BanglaBERT vs LSTM is in progress.