Exploring NLP’s Potential: A Deep Learning Journey — Refining RoBERTa for Analyzing Sentiments in COVID Vaccine Tweets.

Israel Anaba Ayamga
13 min readSep 17, 2023

--

I began a journey to use the capabilities of deep learning models in a data-driven environment, with a focus on Natural Language Processing (NLP). My objective was to document this exhilarating event and share the entire process, from model selection through deployment. So buckle up as I take you through the first chapter of this exciting adventure.

Unveiling the Project

My project’s primary objective was clear: to harness the potential of pre-trained models. After meticulous research, I decided to employ the RoBERTa model from Hugging Face. Its prowess in understanding contextual nuances in text made it an ideal choice for what lay ahead.

The Mission: Sentiment Analysis of COVID Vaccine Tweets

The next step was defining the mission. I set out to fine-tune the RoBERTa model to predict sentiments expressed in tweets concerning the COVID-19 vaccines. This mission wasn’t just about diving into the world of NLP but also contributing to a significant global conversation.

Building the Interactive Interface

As the model gained proficiency in sentiment analysis, the journey continued. To make it accessible and user-friendly, I embedded the model into an API. For this task, I opted for Gradio, a tool that streamlined the process and ensured a seamless user experience. Find in the below some useful resources on gradio including an article i wrote to cover the concept of gradio and a github link on the code and how the project was executed.

Sharing the Model with the World

The culmination of this phase marked a pivotal moment in our project. I sought to share the model and its powerful sentiment analysis capabilities with a wider audience, and Hugging Face emerged as the perfect ally in this endeavor. Hugging Face is a renowned platform recognized for its robust support of AI model hosting and sharing, making it an ideal choice for our mission.

By utilizing Hugging Face, I seamlessly uploaded and hosted my sentiment analysis model, rendering it easily accessible for a global audience. This platform not only provided the technical infrastructure for hosting our model but also ensured that my application could be effortlessly discovered and utilized by individuals, researchers, and developers alike.

With the sentiment analysis app now available on Hugging Face, users can explore and harness its capabilities by simply visiting the link below.

Containerization with Docker

Last but not least, I explored Docker, a powerful framework that allowed me to containerize the entire application. This step ensured portability and scalability, making it easy for anyone to deploy and run the model on their own system. For detailed instructions and resources on using Docker for containerization, you can refer to the following link.

Keep an eye out for the second installment of this essay, in which I’ll go deeper into the fine-tuning process and give insights into the problems and breakthroughs experienced along the road. It’s been an intellectually rewarding and profoundly impactful adventure, and I can’t wait to share it all with you.

From Machine Learning Training to Fine-Tuning Models with a Deep Learning Twist

In my past write-ups, I’ve explored the world of training machine learning models, which has been both engaging and educational. However, my latest focus on fine-tuning pre-trained models from Hugging Face stands out as a unique endeavor.

Fine-tuning these pre-trained models adds a distinct personal touch to an already remarkable foundation. It’s like crafting a unique identity within a familiar framework. This journey, using a fraction of deep learning, represents an innovative way to make these models truly mine.

A Brief Insight into Deep Learning, Hugging Face Models, and Docker Containerization

Deep learning is a subset of machine learning that utilizes artificial neural networks inspired by the human brain. These networks consist of layers of interconnected nodes, known as neurons, which process and analyze data. One key feature of deep learning is its ability to automatically learn and extract intricate patterns and features from large datasets, which is especially valuable in tasks involving unstructured data like images, text, and audio.

Benefits of deep learning:
1. High Accuracy: Deep learning models excel at tasks requiring high precision, such as image and speech recognition, achieving state-of-the-art results.
2. Feature Extraction: They automatically discover complex features, reducing the need for manual feature engineering.
3. Scalability: Deep learning can handle vast amounts of data and scale with additional computational resources.
4. Versatility: It’s used in various domains, from healthcare (diagnosis) to finance (fraud detection) and autonomous vehicles.

In the course of our project, I ventured into the realm of deep learning models, meticulously evaluating various options, including the Albert model, which had also undergone training on the same dataset. However, it became evident that the Albert model yielded a comparatively inferior performance outcome in the sentiment analysis task. This performance discrepancy prompted me to make a strategic decision, leading me to opt for the RoBERTa model — a refined variant of BERT (Bidirectional Encoder Representations from Transformers) meticulously crafted for natural language understanding tasks.

The RoBERTa model, readily accessible through the Hugging Face Transformers library, emerged as an exemplary choice for our project. Its inherent capabilities in text classification, sentiment analysis, and proficiency in other NLP tasks are a testament to its extensive pre-trained knowledge. The performance exhibited by the RoBERTa model in our sentiment analysis endeavors is nothing short of commendable. It played a pivotal role in significantly augmenting the effectiveness of our project, thereby enriching the depth of insights gleaned from our analysis of COVID vaccine-related tweets.

Now, onto Docker and containerization. Docker is a platform for developing, shipping, and running applications in containers. Containers are lightweight, isolated environments that package an application and its dependencies, ensuring consistency and reproducibility across different environments. Containerizing your app means you can encapsulate everything it needs to run, making it easier to manage and deploy.

In my experience, I’ve found that using Docker to containerize your app streamlines development, testing, and deployment. It keeps your project self-contained, preventing conflicts between dependencies and simplifying collaboration with others. Plus, it enables you to easily deploy your app on various cloud platforms, which can be a significant advantage for scalability.

In summary, deep learning with neural networks like Roberta from Hugging Face can supercharge your project’s natural language understanding tasks. Combining this with containerization using Docker makes development and deployment efficient and hassle-free, reflecting your goal of delivering a powerful and scalable solution.

Project Procedure Overview:

I will guide you through the process of utilizing pre-trained models, fine-tuning them on a dataset sourced from the Zindi Challenge website, and deploying the model through an API using Docker. The aim is to leverage these models to enhance predictions.

1. Data Collection and Preparation:
Gathered the dataset from the Zindi Challenge website.
Explored and cleaned the dataset using diverse visualization techniques.
Employed text classification methods, such as lemmatization, to preprocess the text.
Performed thorough data cleaning, including removing hashtags and irrelevant text.

2. Data Splitting:
Split the dataset into training and evaluation sets for model training and validation.

3. Tokenizer Loading:
Loaded the RoBERTa-base tokenizer, which will be used to tokenize the text data.

4. Label Preprocessing:
Preprocessed labels to make them suitable for consumption by the model.

5. Training Configuration:
Specified training arguments, carefully selecting parameter settings for fine-tuning the model.

6. Model Loading and Training:
Loaded the pre-trained model.
Trained the model, focusing on a specific evaluation metric, such as the F1 score, to optimize its performance.

7. Model Deployment:
Push the fine-tuned model to the Hugging Face with it accompanying files

8. APP Integration:
Integrated the machine learning model into an API, enabling remote access through Gradio, and subsequently uploaded the application to Hugging Face.

9. Docker Containerization:
Utilized Docker to containerize the application, ensuring easy deployment and scalability.

I’ll now delve into the technical aspects, providing code snippets and detailed insights into each step of the procedure, aligning the process with my personal style for a concise yet meaningful explanation.

Data Collection and Preparation

Connecting to Raw Data from GitHub

In my data analysis journey, I began by establishing a connection to the raw data sourced from GitHub. This initial step laid the foundation for the subsequent phases of data processing and analysis.

Exploring the Data

Checking Shape, Info, and Missing Values

df.describe

df.isna().sum()

df.shape

To gain a comprehensive understanding of the dataset, I conducted an exploration that involved inspecting its shape, gathering essential information about its structure, and identifying any missing values using the `isna` function. During this process, I uncovered a few missing values, which were promptly addressed.

Visualizing the Data

Assessing Label Distribution and Agreement

we can see that most tweets are between 50 and 150 characters in length, with a peak around 120 characters.
we can see that most tweets are between 50 and 150 characters in length, with a peak around 120 characters.
This helped me to understand the common language used in our dataset

As part of data preprocessing, I turned to data visualization to assess the distribution of labels and the level of agreement within the dataset. Additionally, I aimed to identify prevalent words that could offer valuable insights for our analysis.

Text Data Transformation

Cleaning the ‘safe_tweet’ Column

ef remove_special_characters(text):
# Remove special characters, punctuation, and digits
cleaned_text = re.sub(r'[^\w\s]|[\d]+', '', text)
return cleaned_text

# Apply the function to the 'safe_text' column
df['safe_text'] = df['safe_text'].apply(remove_special_characters)

In an effort to improve the quality of the text data, I focused on the ‘safe_tweet’ column. Here, I meticulously removed unwanted symbols and special characters that had the potential to disrupt the model’s performance. To ensure consistency and uniformity, I also converted the text in this column to lowercase.

Lemmatization for Text Normalization

Leveraging Lemmatization

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to perform lemmatization on a text
def lemmatize_text(text):
words = word_tokenize(text) # Tokenize the text into words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words] # Lemmatize each word
return ' '.join(lemmatized_words) # Join the lemmatized words back into a text

Lemmatization was a critical step in text data preparation. It involved reducing words to their base or root form, enhancing the model’s ability to recognize patterns and relationships within the data. This process played a crucial role in improving the accuracy and effectiveness of our analysis.

This comprehensive data collection and preparation process served as a solid foundation for subsequent stages of analysis, ensuring that our model would work with clean, organized, and standardized data.

Data Splitting

The next crucial step was data splitting. This process involves dividing the dataset into two main sets:

train, eval = train_test_split(df, test_size=0.2, random_state=40, stratify=df['label'])
  1. Training Set: Think of this as the classroom for our model. Here, the model gets to interact with the data, identify underlying patterns, and fine-tune its parameters. It’s where the learning takes place, akin to a student studying diligently.
  2. Evaluation Set: The evaluation set is like the examination room. It’s a distinct subset of the data that the model hasn’t encountered during its training phase. By subjecting the model to this unseen data, I can gauge its ability to generalize its learnings. In essence, it’s a way to ensure the model doesn’t merely memorize the training data but truly comprehends it, just as we want a student to understand a subject rather than rote memorization.

Tokenizer Loading & Label Preprocessing

Tokenizer Loading: I’ve loaded the RoBERTa-base tokenizer, which will be essential for breaking down our text data into manageable pieces.

Label Preprocessing: To ensure our model can work effectively, I’ve preprocessed the labels. This involves converting them into a format that the model can understand. Specifically, I’ve transformed the labels as follows:

  • Negative sentiment (-1) is now represented as 0.
  • Neutral sentiment (0) is now represented as 1.
  • Positive sentiment (1) is now represented as 2.

This step helps align the labels with the model’s expected input.

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

def transform_labels(label):
label = label['label']
num = 0
if label == -1:
num = 0
elif label == 0:
num = 1
elif label == 1:
num = 2
return {'labels': num}

Training Configuration

In order to fine-tune the model to suit our specific task, I meticulously defined the training arguments. These parameters are crucial for achieving optimal results during the training process. Below is an excerpt of the code associated with this step.

training_args = TrainingArguments(
output_dir="sentiment_analysis",
num_train_epochs=5,
load_best_model_at_end=True,
eval_steps=1000,
save_steps=1000,
push_to_hub=True
).....
  • output_dir: This specifies the directory where the trained model and related files will be saved.
  • num_train_epochs: It determines the number of training epochs, which is set to 5 in this case.
  • load_best_model_at_end: This setting ensures that the best model is loaded at the end of training for optimal performance.
  • push_to_hub: This option allows pushing the model to a model hub, making it easily accessible and shareable.

Model Loading and Training

To begin the process, I loaded a pre-trained model known as “Roberta-base” and configured it to handle our specific task, which involves classifying data into three labels.

# Loading a pre-trained model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)

trainer = Trainer(
model=model,
compute_metrics=compute_f1_score,)

# Training the model
trainer.train()

This code segment signifies the critical phase where the pre-trained model is adapted to our task through fine-tuning. The num_labels=3 parameter is tailored to our dataset, which has three distinct classification labels. Subsequently, the model is trained using the defined training arguments to optimize its performance, with a focus on improving the F1 score as the evaluation metric.

This meticulous configuration and fine-tuning process enable us to harness the power of a pre-trained language model for our specific sentiment analysis task.

Model Deployment

I deployed the fine-tuned model, along with its accompanying files, to Hugging Face using the code below.

trainer.push_to_hub()

This step ensures that the model is easily accessible and shareable, marking a significant milestone in our project’s journey.

APP Integration

In continuation with the article, I integrated the fine-tuned sentiment analysis model into an application, allowing remote access through Gradio, and subsequently uploaded the application to Hugging Face. Below is a concise code snippet used to create the application, along with an excerpt of the code used to execute this step.



# Requirements
model_path = "IsaacSarps/sentiment_analysis"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

def sentiment_analysis(text):
text = preprocess(text)

# Tokenize and predict sentiment
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output.logits.softmax(dim=1)[0].tolist()

return scores

demo = gr.Interface(
fn=sentiment_analysis,
inputs=gr.Textbox(placeholder="Share your thoughts on COVID vaccines..."),
outputs="label",
interpretation="default"
)

if __name__ == "__main__":
demo.launch()

This code allows for sentiment analysis through a user-friendly interface, making it accessible remotely via Gradio and hosted on Hugging Face. Find in the link below resources on how to upload the app on hugging face.

This is a preview of the app on Hugging face. Find app in the link above

Docker Containerization

In the culmination of the project, I addressed the imperative need for Docker containerization, a pivotal element that streamlines deployment and ensures scalability of our sentiment analysis application. Docker, a powerful containerization technology, enables us to package our application, its dependencies, and configurations into a single, portable container. This container can then be easily deployed across various environments, eliminating compatibility issues and simplifying the deployment process.

To illustrate this concept, let’s delve into how a simple Dockerfile is created. A Dockerfile serves as a blueprint for constructing a Docker container, specifying the exact steps required to build and configure it. Below is an example of a minimalistic Dockerfile:

FROM python:3.11.4-slim

WORKDIR /app

COPY requirements.txt /app/
RUN pip install -r requirements.txt

COPY src/ /app/

EXPOSE 7860

CMD ["python", "app.py"]

In this Dockerfile:

  1. We start with a base image (python:3.11.4-slim), which forms the foundation of our container.
  2. We set the working directory within the container to /app, ensuring a clean and organized environment for our application.
  3. We copy the requirements.txt file from our local machine to the container and use pip to install the necessary Python packages listed in this file.
  4. The application code located in the src/ directory on our local machine is copied into the /app/ directory within the container.
  5. We expose port 7860 within the container, indicating the port on which our application will listen.
  6. Finally, we define the command to run our application, which in this case is "python app.py".

This Dockerfile encapsulates the steps needed to create a container for our sentiment analysis application. By following these instructions, we ensure that the application and its dependencies are neatly packaged, making it easy to deploy, scale, and manage in various environments.

To build the Docker image from the Dockerfile, you can use the docker build command. Here's how it's done:

docker build -t your-image-name:tag .

docker run -p 8080:7860 --name image-name
  • -t: Specifies the name and optionally a tag to identify the image.
  • your-image-name: The name you want to give to your Docker image.
  • tag: An optional tag to differentiate versions of your image.
  • .: The dot at the end indicates that the Dockerfile is located in the current directory.

Executing this command will create a Docker image according to the instructions in your Dockerfile, ready for deployment and use.

Conclusion

In conclusion, our journey through refining RoBERTa for sentiment analysis in COVID vaccine tweets has been both enlightening and rewarding. Through meticulous experimentation and fine-tuning, we’ve harnessed the power of Natural Language Processing (NLP) to gain deeper insights into public sentiment during these challenging times.

As i wrap up this project, it’s worth noting that NLP’s potential is vast and continuously evolving. To delve deeper into this fascinating field and explore more resources, I encourage you to visit the following links provided on in this article including a direct github repository on the project.

--

--

No responses yet