3  TF_IDF+Logistic Regression Modeling

3.1 Objective

We aim to build a baseline text classification model that predicts the sentiment toward a given entity in a tweet.
The TF-IDF score for a \(term(t)\) in \(document(d)\) is given by:

\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) \]

where:

  • \(\text{TF}(t, d)\) : Term frequency of \(word (t)\) in \(document(d)\)
  • \(\text{DF}(t)\) : Number of documents containing \(word(t)\)
  • \(N\) : Total number of documents

The model performs 4-class classification, identifying whether the sentiment is Positive, Negative, Neutral, or Irrelevant.

For multi-class classification, logistic regression uses the softmax function:

\[ P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^\top \mathbf{x}}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^\top \mathbf{x}}} \]

Where:

  • \(\mathbf{x}\) : The input feature vector (TF-IDF transformed)
  • \(\mathbf{w}_k\) : The weight vector for class \(k\)
  • \(K\) : The number of classes (4 in our case)

3.2 Baseline: Logistic Regression with TF-IDF

Logistic Regression for Multi-class Classification uses a one-vs-rest strategy by default in Scikit-learn, where a separate binary classifier is trained for each class. The model outputs probabilities for each class, and the class with the highest probability is selected as the prediction.

3.2.1 Modeling Pipeline Summary

We build a baseline sentiment classification pipeline using Scikit-learn, consisting of:

3.2.1.1 TF-IDF Vectorization

  • Remove English stopwords
  • Limit vocabulary to the top 10,000 terms
  • Use unigrams and bigrams (1–2 gram)

3.2.1.2 Logistic Regression Classifier

  • A linear model used for multi-class classification
  • max_iter = 1000 ensures convergence for large feature sets

3.2.1.3 Evaluation

  • Dataset split: 80/20 train-test
  • Evaluation via classification_report (Precision, Recall, F1-score for each class)
Code
# Load & Preprocess Data
import pandas as pd
from sklearn.model_selection import train_test_split

col_names = ["id", "entity", "sentiment", "tweet"]

train = pd.read_csv("data/twitter_training.csv", header=None, names=col_names)
train = train.dropna(subset=["tweet"])
train = train[train["tweet"].str.strip().astype(bool)]
X = train["tweet"]
y = train["sentiment"]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = make_pipeline(
    TfidfVectorizer(stop_words='english',max_features=10000, ngram_range=(1,2)),
    LogisticRegression(max_iter=1000)
)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#print(classification_report(y_test, y_pred))

Here is the evaluation result for the Logistic Regression (TF-IDF) baseline model:

Class Precision Recall F1-score Support
Irrelevant 0.75 0.60 0.67 2568
Negative 0.77 0.81 0.79 4463
Neutral 0.75 0.70 0.72 3610
Positive 0.70 0.79 0.75 4124
Accuracy 0.74 14765
Macro avg 0.74 0.73 0.73 14765
Weighted avg 0.74 0.74 0.74 14765

3.3 Save Trained Model for Interpretation

After training, we save the entire pipeline (including the TF-IDF vectorizer and the logistic regression classifier) using joblib. This serialized model will be used later for interpretability analysis with tools like LIME or SHAP.

Code
import joblib

# Save pipeline
joblib.dump(model, "scripts/baseline_pipeline.pkl")
['scripts/baseline_pipeline.pkl']