4 Fine-tuned BERT Model for Sentiment Classification

4.1 Objective

In this section, we fine-tune a BERT model for entity-level sentiment classification on tweets.

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based language model that captures contextual information from both directions (left and right). It has been widely adopted in various NLP tasks due to its deep understanding of sentence semantics. BERT (Bidirectional Encoder Representations from Transformers) [Devlin et al., 2018] is a pre-trained transformer-based language model that captures contextual information from both directions (left and right). It has been widely adopted in various NLP tasks due to its deep understanding of sentence semantics.

During fine-tuning, we adapt the pre-trained BERT model to our specific classification task by adding a classification head on top of the [CLS] token representation. The entire model is then trained end-to-end using our labeled dataset. The softmax output layer is trained to predict one of the sentiment classes.

Given the final hidden vector of the [CLS] token \(\mathbf{h}_{\text{[CLS]}}\), the probability of \(class( k )\) is computed as:

\[ P(y = k \mid \mathbf{h}_{\text{[CLS]}}) = \frac{e^{\mathbf{w}_k^\top \mathbf{h}_{\text{[CLS]}}}}{\sum_{j=1}^K e^{\mathbf{w}_j^\top \mathbf{h}_{\text{[CLS]}}}} \]

This model serves as our primary benchmark for comparison with the logistic regression baseline and for subsequent interpretability analysis using LIME.

4.2 Load Dataset and Preprocess

In this section, we load the training and validation datasets, remove empty tweets, and map sentiment labels to integers: Positive → 0, Neutral → 1, Negative → 2, Irrelevant → 3. This prepares the data for model training.

Code

import pandas as pd
import pandas as pd

train_df = pd.read_csv("data/twitter_training.csv", header=None, names=["id", "entity", "sentiment", "tweet"])
val_df = pd.read_csv("data/twitter_validation.csv", header=None, names=["id", "entity", "sentiment", "tweet"])

# Ensure no missing tweets
train_df = train_df.dropna(subset=["tweet"])
val_df = val_df.dropna(subset=["tweet"])

train_df = train_df[train_df["tweet"].str.strip().astype(bool)]
val_df = val_df[val_df["tweet"].str.strip().astype(bool)]

# Map sentiment to label
label_map = {
    "Positive": 0,
    "Neutral": 1,
    "Negative": 2,
    "Irrelevant": 3
}

train_df["label"] = train_df["sentiment"].map(label_map)
val_df["label"] = val_df["sentiment"].map(label_map)

4.3 Tokenization with HuggingFace

We use Hugging Face’s AutoTokenizer to tokenize tweets into model-ready inputs, including input_ids and attention_mask. The dataset is split and preprocessed in batches using the datasets library.

Code

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["tweet"], padding="max_length", truncation=True, max_length=128)

from datasets import Dataset
dataset = Dataset.from_pandas(train_df[["tweet", "label"]])
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.map(tokenize, batched=True)
dataset = dataset.map(lambda x: {"label": int(x["label"])})
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

4.4 Model Setup: BERT + Classification Head

We load a pretrained bert-base-uncased model with a classification head (AutoModelForSequenceClassification) to perform four-class sentiment prediction. The model is moved to GPU or CPU depending on availability.

Code

from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=4, bias=True)
)

4.5 Training with Trainer API

The model is fine-tuned using Hugging Face’s Trainer API, which simplifies training and evaluation by managing data loading, loss computation, gradient updates, and metric reporting.

Code

from transformers import TrainingArguments, Trainer, EvalPrediction 
from  sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.metrics import classification_report


def compute_metrics(pred):
    preds = pred.predictions.argmax(axis=-1)
    labels = pred.label_ids
    acc = accuracy_score(labels, preds)
    prf = precision_recall_fscore_support(labels, preds, average='macro')
    return {
        "accuracy": acc,
        "precision": prf[0],
        "recall": prf[1],
        "f1": prf[2]
    }


training_args = TrainingArguments(
    output_dir="./bert_model",
    eval_strategy="epoch", 
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)




trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)


#trainer.train()

C:\Users\16925\AppData\Local\Temp\ipykernel_34876\3269255061.py:35: FutureWarning:

`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

4.6 Save Model

Code

#trainer.evaluate()

# Save model and tokenizer
#trainer.save_pretrained("scripts/bert_model4")           
#tokenizer.save_pretrained("scripts/bert_model4")

4.7 Evaluate & Save Model

Code

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "scripts/bert_model4", local_files_only=True
)
tokenizer = AutoTokenizer.from_pretrained("scripts/bert_model4", local_files_only=True)

model.to("cuda")

trainer = Trainer(
    model=model,
    args=training_args,           # TrainingArguments
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

#trainer.evaluate()

C:\Users\16925\AppData\Local\Temp\ipykernel_34876\1614260637.py:10: FutureWarning:

`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

Here is the evaluation result for the fine-tuned BERT model:

Metric	Value
Evaluation Loss	0.1038
Accuracy	0.9683
Precision	0.9609
Recall	0.9697
F1-score	0.9679
Evaluation Runtime (s)	19.2541
Samples/sec	766.85
Steps/sec	47.94

Here is a comparison of the BERT model vs. the Logistic Regression (TF-IDF) baseline model

Code

import matplotlib.pyplot as plt

# Metrics and values
metrics = ["Accuracy", "Precision", "Recall", "F1-score"]
logreg_scores = [0.74, 0.74, 0.74, 0.74]
bert_scores = [0.968, 0.969, 0.967, 0.968]

x = range(len(metrics))
bar_width = 0.35

# Plot
plt.figure(figsize=(8, 5))
plt.bar([i - bar_width/2 for i in x], logreg_scores, width=bar_width, label="Logistic Regression (TF-IDF)")
plt.bar([i + bar_width/2 for i in x], bert_scores, width=bar_width, label="BERT (Fine-Tuned)")

plt.xticks(x, metrics)
plt.ylabel("Score")
plt.ylim(0.6, 1.0)
plt.title("Performance Comparison: Logistic Regression vs. BERT")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Key Findings:

Significant Accuracy Gain: BERT achieves a ~23% improvement in accuracy compared to the baseline model (TF-IDF + Logistic Regression), indicating stronger predictive capability and better overall performance.

Balanced Precision and Recall: The BERT model maintains both high precision and recall, which suggests that it not only avoids false positives but also successfully identifies relevant instances. This balance is crucial for robust sentiment classification on noisy social media data.

Superior F1-Score: The improvement in F1-score reflects BERT’s ability to generalize across diverse sentiment classes. It achieves a better trade-off between precision and recall, minimizing both false positives and false negatives.