Interpreting NLP Models: A Stability and Explainability Comparison of BERT and Logistic Regression

Authors

Jinze Shi (js6605)

Xiying Mao (xm2335)

Published

May 4, 2025

0.1 Background and Motivation

As NLP models are increasingly deployed in applications such as sentiment analysis and public opinion tracking, understanding the reasoning behind their predictions has become critical—especially for short, informal, and noisy text like tweets.

In this project, we explore the interpretability of a fine-tuned BERT model—a high-performing yet opaque deep learning model—by comparing it to a much simpler, inherently interpretable linear model based on TF-IDF and Logistic Regression.

We apply two widely-used model explanation techniques, LIME and SHAP, to examine and contrast the reasoning processes behind these models’ predictions. Our goal is to assess whether the accuracy advantage offered by deep models like BERT comes at a significant cost to interpretability, and to what extent this trade-off manifests in real-world, noisy data. By analyzing explanation stability, word importance attribution, and human interpretability, we aim to understand the practical implications of using black-box models in socially sensitive NLP applications.

0.2 Research Questions

This project aims to answer the following research questions:

Our project investigates the following research questions:

To what extent do LIME and SHAP produce consistent and meaningful explanations for short, noisy texts such as tweets?
How does the interpretability of a simple, transparent model (TF-IDF + Logistic Regression) compare to that of a complex, high-performing model (fine-tuned BERT)?
Are the explanations generated by each model stable under small perturbations of the input text?
Does the increase in predictive accuracy offered by deep models like BERT come at a significant cost to explanation quality?
How can visualizations support the comparison of interpretability across different models and explanation methods?

0.3 Methodology Overview

Our analysis is structured into the following components, corresponding directly to the report’s chapter structure:

TF-IDF + Logistic Regression Modeling
We construct a transparent linear baseline model using TF-IDF vectorization and Logistic Regression for sentiment classification on tweets.
Fine-tuned BERT Model for Sentiment Classification
A deep contextual model is built by fine-tuning a pretrained BERT model, offering stronger performance but lower inherent interpretability.
LIME Interpretability Evaluation for BERT
Apply LIME to the BERT model to extract local word-level explanations and assess the effectiveness of linear approximations in capturing contextual semantics.
LIME Interpretability for Logistic Regression
Use LIME on the linear baseline to validate how well it reflects token-level contributions in an inherently interpretable model.
SHAP Interpretability Evaluation for BERT
Employ SHAP to examine BERT predictions and quantify feature importance through additive token attributions based on Shapley values.
SHAP Interpretability for Logistic Regression
Use SHAP to explain predictions of the linear model, validating consistency with model coefficients and understanding feature-level impacts.
Perturbation Stability: Comparing LIME and SHAP for Logistic Regression and BERT
Conduct synonym replacement and measure Jaccard similarity to evaluate how robust LIME and SHAP explanations are under input variation.
Conclusion
Summarize key findings on interpretability, robustness, and the trade-off between performance and explainability in modern NLP pipelines.

Each section investigates word-level feature attribution, visualization (via bar charts and word clouds), and explanation consistency. This structure allows a clear comparison between simple vs. complex models, and model-agnostic vs. model-aware interpretability methods, specifically in the context of short, noisy social media text.

0.4 Dataset

We use the Twitter Entity Sentiment Analysis dataset, which includes:

A tweet
A named entity mentioned in the tweet
A sentiment label (Positive, Neutral, Negative, Irrelevant) reflecting public opinion

This is an entity-level sentiment analysis dataset. Given a message and an entity, the task is to classify the sentiment of the message about the entity. Messages that are not relevant to the entity (i.e., Irrelevant) are treated as Neutral in our implementation

0.5 Authors

Shi Jinze (js6605)
Mao Xiying (xm2335)

0.6 Date

April 2025