2 Dataset Overview

2.1 Dataset Description

This project uses the Twitter Entity Sentiment Analysis dataset. It consists of two files:

twitter_training.csv: Main training dataset
twitter_validation.csv: Validation dataset

Each row contains: - an ID - a target entity - the sentiment label: Positive, Neutral, Negative, or Irrelevant - a tweet

The task is to predict the sentiment expressed toward the entity.

2.2 Sample Records (Training Set)

We display a few sample records from the training set to get a sense of what the tweets and associated sentiment labels look like. This is helpful for qualitative understanding of the input data before preprocessing.

Code

import pandas as pd

# Define column names
col_names = ["id", "entity", "sentiment", "tweet"]

# Load CSVs with no header row
train = pd.read_csv("data/twitter_training.csv", header=None, names=col_names)
valid = pd.read_csv("data/twitter_validation.csv", header=None, names=col_names)

train.sample(5)[["tweet", "entity", "sentiment"]]

	tweet	entity	sentiment
35197	It seems like a lot of the higher influencers ...	Microsoft	Positive
12070	Just trying to the spread love	NBA2K	Positive
9030	@CODLeague when I watch the NFL I dont see NBA...	Overwatch	Negative
50037	Great performances from @loucollsport learners...	FIFA	Irrelevant
37016	His playing day for Microsoft and... 7 years i...	Microsoft	Positive

2.3 Data Cleaning

The dataset is used to train and evaluate sentiment classification models.

2.3.1 Delete Missing Data

In this step, we remove rows with missing or empty tweet texts to ensure clean inputs for training. Below is the number of missing data in each feature.

Code

# Remove rows with missing or empty tweets
print(train.isnull().sum())

train = train.dropna(subset=["tweet"])
train = train[train["tweet"].str.strip().astype(bool)]

id             0
entity         0
sentiment      0
tweet        686
dtype: int64

2.3.2 Delete Missing Emoji

In this step, we remove emojis from tweets to clean the text and ensure consistent tokenization for vectorization. Here is the examples:

Code

import emoji

# Define cleaning function
def clean_text(text):
    no_emoji = emoji.replace_emoji(text, replace='')
    return no_emoji.encode("utf-8", "ignore").decode("utf-8", "ignore")

# Apply to training set
train["tweet"] = train["tweet"].apply(clean_text)

#ourexample
samples = [
    "I'm so happy today! 😄🎉",
    "Great job! 💯🔥",
    "This is weird... 🤔🙃",
    "Just finished my code 🐍💻"
]

max_len = max(len(s) for s in samples)

for s in samples:
    cleaned = clean_text(s)
    print(f"{s.ljust(max_len)}  →  {cleaned}")

I'm so happy today! 😄🎉    →  I'm so happy today! 
Great job! 💯🔥             →  Great job! 
This is weird... 🤔🙃       →  This is weird... 
Just finished my code 🐍💻  →  Just finished my code

2.4 Basic Statistics

We explore the basic statistics of the dataset, including class distributions and dataset sizes. This helps us understand potential class imbalance and verify the dataset was loaded correctly.

Code

sentiment_counts = train["sentiment"].value_counts().reset_index()
sentiment_counts.columns = ["Sentiment", "Count"]
sentiment_counts.index.name = "Index"

sentiment_counts.style.set_table_styles(
    [{"selector": "th", "props": [("text-align", "center")]}]
).set_properties(**{
    'text-align': 'center',
    'border': '1px solid lightgrey',
    'background-color': '#f9f9f9'
}).hide(axis="index")

Sentiment	Count
Negative	22312
Positive	20619
Neutral	18051
Irrelevant	12842

We explore the basic statistics of the dataset, including class distributions and dataset sizes. This helps us understand potential class imbalance and verify the dataset was loaded correctly.
Based on the distribution of sentiment labels, we did not observe a significant class imbalance in the dataset.

2.5 Sentiment Distribution (Bar-Chart)

Code

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.countplot(data=train, x="sentiment")
plt.title("Sentiment Distribution in Training Set")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.grid(True)
plt.tight_layout()
plt.show()