2  Dataset Overview

2.1 Dataset Description

This project uses the Twitter Entity Sentiment Analysis dataset. It consists of two files:

  • twitter_training.csv: Main training dataset
  • twitter_validation.csv: Validation dataset

Each row contains: - an ID - a target entity - the sentiment label: Positive, Neutral, Negative, or Irrelevant - a tweet

The task is to predict the sentiment expressed toward the entity.


2.2 Sample Records (Training Set)

We display a few sample records from the training set to get a sense of what the tweets and associated sentiment labels look like. This is helpful for qualitative understanding of the input data before preprocessing.

Code
import pandas as pd

# Define column names
col_names = ["id", "entity", "sentiment", "tweet"]

# Load CSVs with no header row
train = pd.read_csv("data/twitter_training.csv", header=None, names=col_names)
valid = pd.read_csv("data/twitter_validation.csv", header=None, names=col_names)

train.sample(5)[["tweet", "entity", "sentiment"]]
tweet entity sentiment
35197 It seems like a lot of the higher influencers ... Microsoft Positive
12070 Just trying to the spread love NBA2K Positive
9030 @CODLeague when I watch the NFL I dont see NBA... Overwatch Negative
50037 Great performances from @loucollsport learners... FIFA Irrelevant
37016 His playing day for Microsoft and... 7 years i... Microsoft Positive

2.3 Data Cleaning

The dataset is used to train and evaluate sentiment classification models.

2.3.1 Delete Missing Data

In this step, we remove rows with missing or empty tweet texts to ensure clean inputs for training. Below is the number of missing data in each feature.

Code
# Remove rows with missing or empty tweets
print(train.isnull().sum())

train = train.dropna(subset=["tweet"])
train = train[train["tweet"].str.strip().astype(bool)]
id             0
entity         0
sentiment      0
tweet        686
dtype: int64

2.3.2 Delete Missing Emoji

In this step, we remove emojis from tweets to clean the text and ensure consistent tokenization for vectorization. Here is the examples:

Code
import emoji

# Define cleaning function
def clean_text(text):
    no_emoji = emoji.replace_emoji(text, replace='')
    return no_emoji.encode("utf-8", "ignore").decode("utf-8", "ignore")

# Apply to training set
train["tweet"] = train["tweet"].apply(clean_text)

#ourexample
samples = [
    "I'm so happy today! 😄🎉",
    "Great job! 💯🔥",
    "This is weird... 🤔🙃",
    "Just finished my code 🐍💻"
]

max_len = max(len(s) for s in samples)

for s in samples:
    cleaned = clean_text(s)
    print(f"{s.ljust(max_len)}{cleaned}")
I'm so happy today! 😄🎉    →  I'm so happy today! 
Great job! 💯🔥             →  Great job! 
This is weird... 🤔🙃       →  This is weird... 
Just finished my code 🐍💻  →  Just finished my code 

2.4 Basic Statistics

We explore the basic statistics of the dataset, including class distributions and dataset sizes. This helps us understand potential class imbalance and verify the dataset was loaded correctly.

Code
sentiment_counts = train["sentiment"].value_counts().reset_index()
sentiment_counts.columns = ["Sentiment", "Count"]
sentiment_counts.index.name = "Index"

sentiment_counts.style.set_table_styles(
    [{"selector": "th", "props": [("text-align", "center")]}]
).set_properties(**{
    'text-align': 'center',
    'border': '1px solid lightgrey',
    'background-color': '#f9f9f9'
}).hide(axis="index")
Sentiment Count
Negative 22312
Positive 20619
Neutral 18051
Irrelevant 12842

We explore the basic statistics of the dataset, including class distributions and dataset sizes. This helps us understand potential class imbalance and verify the dataset was loaded correctly.
Based on the distribution of sentiment labels, we did not observe a significant class imbalance in the dataset.

2.5 Sentiment Distribution (Bar-Chart)

Code
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.countplot(data=train, x="sentiment")
plt.title("Sentiment Distribution in Training Set")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.grid(True)
plt.tight_layout()
plt.show()