Analyzing Movie Genre Predictions through the Lens of Hugging Face Transformers and a Training Loop Approach¶
by Salomon Marquez
01/07/2024
This notebook proposes a solution to the Movie Genre Prediction competition from Hugging Face using a BERT-based model to classify movie genres based on their title and synopses. It implements a training loop manually instead of using the trainer API from Hugging Face. This choice aims to improve the fine-tuning phase by manually setting and optimizing selected hyperparameters.
The resulted predicted scores obtained by the fine-tuned model are as follows:
- Public Score: 0.4260611
- Private Score: 0.4184444
The fine-tuning stage required 1.5 compute units of type T4 GPU. The execution time for this task took 23 min, utilizing the provided movie dataset. Subsequently, the prediction stage required approximately 2.5 min.
This notebook was inspired by Anubhav's solution and the concepts acquired from the Hugging Face NLP course.
# View the infrastructure provided by Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Not connected to a GPU')
else:
print(gpu_info)
Thu Jan 11 11:08:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
# View the assigned RAM memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))
if ram_gb < 20:
print('Not using a high-RAM runtime')
else:
print('You are using a high-RAM runtime!')
Your runtime has 13.6 gigabytes of available RAM Not using a high-RAM runtime
# Install libraries and modules
!pip install evaluate datasets transformers[sentencepiece]
!pip install accelerate -U
!pip install huggingface_hub
Requirement already satisfied: evaluate in /usr/local/lib/python3.10/dist-packages (0.4.1) Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.16.1) Requirement already satisfied: transformers[sentencepiece] in /usr/local/lib/python3.10/dist-packages (4.35.2) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.23.5) Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.3.7) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.5.3) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2.31.0) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from evaluate) (4.66.1) Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from evaluate) (3.4.1) Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.70.15) Requirement already satisfied: fsspec[http]>=2021.05.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2023.6.0) Requirement already satisfied: huggingface-hub>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.20.2) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from evaluate) (23.2) Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.18.0) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1) Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (10.0.1) Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6) Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.1) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (2023.6.3) Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (0.15.0) Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (0.4.1) Requirement already satisfied: sentencepiece!=0.1.92,>=0.1.91 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (0.1.99) Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (3.20.3) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1) Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.7.0->evaluate) (4.5.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2023.11.17) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2023.3.post1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0) Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.26.0) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.23.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (23.2) Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5) Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.1) Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.1.0+cu121) Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.20.2) Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.4.1) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.13.1) Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (4.5.0) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.2.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.2) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2023.6.0) Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.1.0) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->accelerate) (2.31.0) Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->accelerate) (4.66.1) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (2023.11.17) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0) Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.10/dist-packages (0.20.2) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (3.13.1) Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2023.6.0) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2.31.0) Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.66.1) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (6.0.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.5.0) Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (23.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2023.11.17)
# Check transformers and accelerate modules version
import transformers
import accelerate
transformers.__version__, accelerate.__version__
('4.35.2', '0.26.0')
# Login to Huggingface_hub to access movie dataset
from huggingface_hub import notebook_login
notebook_login()
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
# Import libraries and packages
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset, Dataset
from collections import Counter
import evaluate
import numpy as np
import pandas as pd
from rich import print
Loading Movie Datasets¶
# Load competition datasets
raw_datasets = load_dataset("datadrivenscience/movie-genre-prediction")
raw_datasets
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
Downloading readme: 0%| | 0.00/1.19k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/7.16M [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/4.74M [00:00<?, ?B/s]
Generating train split: 0%| | 0/54000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/36000 [00:00<?, ? examples/s]
DatasetDict({
train: Dataset({
features: ['id', 'movie_name', 'synopsis', 'genre'],
num_rows: 54000
})
test: Dataset({
features: ['id', 'movie_name', 'synopsis', 'genre'],
num_rows: 36000
})
})
# Explore train dataset
raw_datasets["train"].features
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'genre': Value(dtype='string', id=None)}
# Explore train dataset
raw_datasets["train"][:5]
{'id': [44978, 50185, 34131, 78522, 2206],
'movie_name': ['Super Me',
'Entity Project',
'Behavioral Family Therapy for Serious Psychiatric Disorders',
'Blood Glacier',
'Apat na anino'],
'synopsis': ['A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.',
'A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular.',
'This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses.',
'Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife.',
'Buy Day - Four Men Widely - Apart in Life - By Night Shadows United in One Fight Venting the Fire of their Fury Against the Hated Oppressors.'],
'genre': ['fantasy', 'horror', 'family', 'scifi', 'action']}
# Explore test dataset
raw_datasets["test"].features
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'genre': Value(dtype='string', id=None)}
# Explore test dataset
raw_datasets["test"][:5]
{'id': [16863, 48456, 41383, 84007, 40269],
'movie_name': ['A Death Sentence',
'Intermedio',
'30 Chua Phai Tet',
'Paranoiac',
'Ordinary Happiness'],
'synopsis': ["12 y.o. Ida's dad'll die without a DKK1,500,000 operation. Ida plans to steal the money from the bank, her mom installed alarm systems in. She'll need her climbing skills, her 2 friends and 3 go-karts.",
'A group of four teenage friends become trapped in a Mexican border tunnel where they fall prey, one-by one, to tortured ghosts who haunt it.',
"A guy left his home for 12 years till he came back to claim what's his from his father, the vast Land, just to uncover that he had to live that day, year-end Lunar day, for another 12 years.",
'A man long believed dead returns to the family estate to claim his inheritance.',
'After a deadly accident, Paolo comes back on Earth just 92 minutes more, thanks to a calculation error made in a paradise office.'],
'genre': ['action', 'action', 'action', 'action', 'action']}
What are the existing movie genres?¶
# Identifying the existing genres in the train dataset
labels = set(raw_datasets["train"]["genre"])
num_labels = len(labels)
num_labels, labels
(10,
{'action',
'adventure',
'crime',
'family',
'fantasy',
'horror',
'mystery',
'romance',
'scifi',
'thriller'})
# Counting the number of movies per genre in the train dataset
labels_count = Counter(raw_datasets['train']['genre'])
print(labels_count)
Counter({ 'fantasy': 5400, 'horror': 5400, 'family': 5400, 'scifi': 5400, 'action': 5400, 'crime': 5400, 'adventure': 5400, 'mystery': 5400, 'romance': 5400, 'thriller': 5400 })
# Counting the number of movies per genre in the test dataset
labels_count_test = Counter(raw_datasets['test']['genre'])
print(labels_count_test)
Counter({'action': 36000})
# Rename "genre" column as "labels" in the train dataset and turn into a ClassLabel type
raw_datasets = raw_datasets.rename_column('genre','labels')
raw_datasets = raw_datasets.class_encode_column('labels')
raw_datasets['train'].features
Casting to class labels: 0%| | 0/54000 [00:00<?, ? examples/s]
Casting to class labels: 0%| | 0/36000 [00:00<?, ? examples/s]
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'labels': ClassLabel(names=['action', 'adventure', 'crime', 'family', 'fantasy', 'horror', 'mystery', 'romance', 'scifi', 'thriller'], id=None)}
Answer: The train dataset contains 10 genres that seemed to be evenly distributed accross the dataset. Meanwhile, the test dataset only contains action movies as dummy values prior inference.
Removing Duplicated Items¶
# Convert Datasets into Dataframes
raw_datasets.set_format('pandas')
# Convert Datasets into Dataframes
train_dataset = raw_datasets['train'][:]
train_dataset.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 54000 entries, 0 to 53999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 54000 non-null int64 1 movie_name 54000 non-null object 2 synopsis 54000 non-null object 3 labels 54000 non-null int64 dtypes: int64(2), object(2) memory usage: 15.4 MB
train_dataset.head(3)
| id | movie_name | synopsis | labels | |
|---|---|---|---|---|
| 0 | 44978 | Super Me | A young scriptwriter starts bringing valuable ... | 4 |
| 1 | 50185 | Entity Project | A director and her friends renting a haunted h... | 5 |
| 2 | 34131 | Behavioral Family Therapy for Serious Psychiat... | This is an educational video for families and ... | 3 |
# Drop duplicates from the train dataframe
train_dataset = train_dataset.drop_duplicates(['movie_name', 'synopsis'])
train_dataset.info(memory_usage = 'deep')
<class 'pandas.core.frame.DataFrame'> Int64Index: 46344 entries, 0 to 53998 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 46344 non-null int64 1 movie_name 46344 non-null object 2 synopsis 46344 non-null object 3 labels 46344 non-null int64 dtypes: int64(2), object(2) memory usage: 13.8 MB
Answer: The train dataset cointained 7,656 duplicates.
Analyzing text movie titles and their synopses¶
This section analysis the length of movie titles and their synopses.
train_dataset.head(3)
| id | movie_name | synopsis | labels | |
|---|---|---|---|---|
| 0 | 44978 | Super Me | A young scriptwriter starts bringing valuable ... | 4 |
| 1 | 50185 | Entity Project | A director and her friends renting a haunted h... | 5 |
| 2 | 34131 | Behavioral Family Therapy for Serious Psychiat... | This is an educational video for families and ... | 3 |
# Create a new column "synopsis_len" that contains the synopsis length
train_dataset['synopsis_len'] = train_dataset['synopsis'].apply(lambda x: len(x))
train_dataset.head(3)
| id | movie_name | synopsis | labels | synopsis_len | |
|---|---|---|---|---|---|
| 0 | 44978 | Super Me | A young scriptwriter starts bringing valuable ... | 4 | 141 |
| 1 | 50185 | Entity Project | A director and her friends renting a haunted h... | 5 | 120 |
| 2 | 34131 | Behavioral Family Therapy for Serious Psychiat... | This is an educational video for families and ... | 3 | 164 |
# Create a new column "movie_name_len" that contains the length of movie_name
train_dataset['movie_name_len'] = train_dataset['movie_name'].apply(lambda x: len(x))
train_dataset.head(3)
| id | movie_name | synopsis | labels | synopsis_len | movie_name_len | |
|---|---|---|---|---|---|---|
| 0 | 44978 | Super Me | A young scriptwriter starts bringing valuable ... | 4 | 141 | 8 |
| 1 | 50185 | Entity Project | A director and her friends renting a haunted h... | 5 | 120 | 14 |
| 2 | 34131 | Behavioral Family Therapy for Serious Psychiat... | This is an educational video for families and ... | 3 | 164 | 59 |
# Order train dataframe by synopsis_len
train_dataset.sort_values(by='synopsis_len', ascending=False)
| id | movie_name | synopsis | labels | synopsis_len | movie_name_len | |
|---|---|---|---|---|---|---|
| 52518 | 46444 | Final Destination | Alex Browning is among a group of high school ... | 5 | 400 | 17 |
| 49498 | 1468 | Bhargava Ramudu | Bhargava, an efficient, yet jobless young man ... | 0 | 395 | 15 |
| 29141 | 71309 | Krishnatulasi | Krishna is a blind young man who works as a gu... | 7 | 381 | 13 |
| 50834 | 44856 | The Sex Cycle | The Cocoa Poodle bar is the central meeting pl... | 4 | 377 | 13 |
| 53370 | 4779 | Uro | Turning his back on a delinquent past and join... | 0 | 370 | 3 |
| ... | ... | ... | ... | ... | ... | ... |
| 6891 | 71298 | Qismat 2 | Fortune 2. | 7 | 10 | 8 |
| 38284 | 5454 | Rader | Invasion. | 0 | 9 | 5 |
| 3301 | 15654 | Adventure Night | TBD | 1 | 3 | 15 |
| 34698 | 42213 | Dark Army | NA. | 4 | 3 | 9 |
| 26774 | 34314 | Prima Ballerina | TBA | 3 | 3 | 15 |
46344 rows × 6 columns
import plotly.figure_factory as ff
fig = ff.create_distplot([train_dataset['synopsis_len']], ['length'], colors=['#2ca02c'])
fig.update_layout(title_text='Word Count Distribution of Movie Synopsis')
fig.show()
fig2 = ff.create_distplot([train_dataset['movie_name_len']], ['length'], colors=['#ffa408'])
fig2.update_layout(title_text='Word Count Distribution of Movie Titles')
fig2.show()
train_dataset['movie_name_len'].max(), train_dataset['synopsis_len'].max()
(180, 400)
The average length of characters in the movie name is 12. For the synopsis, we see two peaks around 145 and 230 characters. The maximum character size is 180 for the movie name and 400 for the synopsis. So, there won't be any issues during tokenization and training because the bert-base-uncased model supports a maximum character length of 512.
Tokenization¶
# Convert the train_dataset Dataframe to DataSet format again
train_ds = Dataset.from_pandas(train_dataset)
train_ds.features
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'labels': Value(dtype='int64', id=None),
'synopsis_len': Value(dtype='int64', id=None),
'movie_name_len': Value(dtype='int64', id=None),
'__index_level_0__': Value(dtype='int64', id=None)}
# Turn "labels" column into ClassLabel type
train_ds = train_ds.class_encode_column('labels')
train_ds.features
Stringifying the column: 0%| | 0/46344 [00:00<?, ? examples/s]
Casting to class labels: 0%| | 0/46344 [00:00<?, ? examples/s]
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'labels': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None),
'synopsis_len': Value(dtype='int64', id=None),
'movie_name_len': Value(dtype='int64', id=None),
'__index_level_0__': Value(dtype='int64', id=None)}
# Create tokenizer
# i.e. bert-base-uncased, bert-large-uncased, bert-large-uncased-whole-word-masking
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer
tokenizer_config.json: 0%| | 0.00/28.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/570 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
# Do a sample tokenization
sample_tokenized = tokenizer(train_ds['movie_name'][0], train_ds['synopsis'][0])
tokenizer.decode(sample_tokenized['input_ids'])
'[CLS] super me [SEP] a young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. selling them makes him rich. [SEP]'
sample_tokenized
{'input_ids': [101, 3565, 2033, 102, 1037, 2402, 5896, 15994, 4627, 5026, 7070, 5200, 2067, 2013, 2010, 2460, 15446, 1997, 2108, 13303, 2011, 1037, 5698, 1012, 4855, 2068, 3084, 2032, 4138, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# Split Train Dataset (train_ds) into training and test datasets
train_ds = train_ds.train_test_split(test_size=0.2, stratify_by_column="labels")
train_ds
DatasetDict({
train: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__'],
num_rows: 37075
})
test: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__'],
num_rows: 9269
})
})
# Define a tokenize function
def tokenize(ds):
return tokenizer(ds['movie_name'], ds['synopsis'], truncation=True)
# Tokenize train_ds
tokenized_datasets = train_ds.map(tokenize, batched=True)
tokenized_datasets
Map: 0%| | 0/37075 [00:00<?, ? examples/s]
Map: 0%| | 0/9269 [00:00<?, ? examples/s]
DatasetDict({
train: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 37075
})
test: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 9269
})
})
# Select a random sample to verify tokenization
tokenizer.decode(tokenized_datasets['train']['input_ids'][37074])
'[CLS] begum [SEP] a sheltered beauty, begum, is introduced to the enchanting world of bollywood by the enigmatic madan where she discovers true freedom and love come at the price of her passion and life. [SEP]'
Preparing data for the training stage¶
# Removing columns the model doesn't expect
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'movie_name', 'synopsis', 'synopsis_len', 'movie_name_len','__index_level_0__'])
tokenized_datasets['train'].column_names
['labels', 'input_ids', 'token_type_ids', 'attention_mask']
# Setting the datasets format so that they can return Pytorch tensors
tokenized_datasets.set_format("torch")
# Define a data_collator function for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Defining DataLoaders
from torch.utils.data import DataLoader
train_dataloader = DataLoader(
tokenized_datasets['train'], shuffle=True, batch_size=32, collate_fn=data_collator
)
eval_dataloader = DataLoader(
tokenized_datasets['test'], batch_size=64, collate_fn=data_collator
)
# Inspecting a batch from train_dataloader
for batch in train_dataloader:
break
{k:v.shape for k,v in batch.items()}
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'labels': torch.Size([32]),
'input_ids': torch.Size([32, 65]),
'token_type_ids': torch.Size([32, 65]),
'attention_mask': torch.Size([32, 65])}
# Inspecting a batch from train_dataloader
batch.input_ids
tensor([[ 101, 3019, 5320, ..., 0, 0, 0],
[ 101, 1051, 10381, ..., 0, 0, 0],
[ 101, 13970, 13278, ..., 0, 0, 0],
...,
[ 101, 14477, 9587, ..., 0, 0, 0],
[ 101, 1037, 3543, ..., 0, 0, 0],
[ 101, 15274, 1004, ..., 0, 0, 0]])
Step-by-step setting of the training stage¶
Model Instantiation¶
# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# Passing a single batch to our model to check that everything is OK
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)
tensor(2.3496, grad_fn=<NllLossBackward0>) torch.Size([32, 10])
Note: When labels are provided, HF transformers will return the loss and the logits (two for each input)
outputs.logits
tensor([[-4.0785e-01, -5.9843e-01, 2.3831e-01, -3.6281e-01, 7.7003e-02,
-7.9042e-01, 3.7949e-01, 6.3509e-02, 2.5117e-01, -4.0870e-01],
[-4.7463e-01, 8.1971e-01, -5.7489e-01, 2.9883e-01, -5.2216e-03,
2.9287e-01, 1.3395e-01, -1.2713e-01, -6.2679e-03, 2.6193e-01],
[-3.2331e-01, -7.0246e-01, 3.2733e-01, -4.3661e-01, 1.2958e-02,
-9.1787e-01, 4.7303e-01, 3.7585e-02, 4.2619e-01, -4.0903e-01],
[-3.8950e-01, -5.9636e-01, 1.9592e-01, -3.3946e-01, 6.1083e-02,
-7.2596e-01, 4.0944e-01, 6.8973e-02, 2.9999e-01, -4.2091e-01],
[-3.8568e-01, -5.6021e-01, 1.7959e-01, -3.0257e-01, 6.5529e-02,
-7.6407e-01, 3.8535e-01, 2.6806e-02, 2.6749e-01, -4.2585e-01],
[-3.6171e-01, -6.5866e-01, 3.1120e-01, -3.8170e-01, 5.1930e-02,
-8.8321e-01, 4.5273e-01, 5.1025e-02, 3.6372e-01, -4.0050e-01],
[-4.2266e-01, -4.7145e-01, 9.5230e-02, -2.3389e-01, 1.3982e-01,
-6.2665e-01, 3.7735e-01, 9.9706e-02, 1.8753e-01, -4.0756e-01],
[-3.1637e-01, 8.8094e-01, -5.1321e-01, 5.4990e-01, 1.2587e-01,
3.7881e-01, 6.1411e-02, -1.5915e-01, -1.2994e-01, 4.6542e-01],
[-3.8507e-01, 8.0821e-01, -4.5865e-01, 4.5561e-01, 7.2987e-02,
2.9142e-01, 1.0862e-01, -1.3985e-01, -5.3073e-02, 3.0885e-01],
[-3.1688e-01, 8.4953e-01, -6.1110e-01, 4.8324e-01, 3.0182e-02,
5.0664e-01, 1.1268e-01, -1.3150e-01, -1.6073e-01, 2.8703e-01],
[-5.1548e-01, -1.6234e-01, -1.9065e-02, -1.6256e-01, 7.9030e-02,
-2.1718e-01, 2.8946e-01, 1.8431e-01, 5.1388e-02, -9.4204e-02],
[-4.1334e-01, -3.1754e-01, 3.5393e-02, -2.6649e-02, 4.9808e-02,
-1.2166e-01, 4.8317e-02, 1.0462e-01, 1.7923e-01, -1.8917e-01],
[-4.2132e-01, -4.9542e-01, 2.3680e-01, -2.0575e-01, 1.0921e-01,
-5.6644e-01, 2.0891e-01, 1.9182e-01, 2.1404e-01, -4.3724e-01],
[-6.1001e-01, 2.9906e-01, -2.8272e-01, -5.8040e-03, 1.1232e-01,
-5.5117e-03, 2.3484e-01, -6.8720e-02, 1.3603e-01, 2.2902e-01],
[-4.2867e-01, -3.6328e-01, 1.3073e-01, -2.2871e-01, 7.4158e-02,
-5.7429e-01, 3.9471e-01, 1.5585e-01, 1.6983e-01, -3.9815e-01],
[-3.4650e-01, -6.8317e-01, 3.0791e-01, -4.1893e-01, 3.2079e-02,
-8.8611e-01, 4.6996e-01, 6.2504e-02, 3.8550e-01, -3.9998e-01],
[-5.0109e-01, 8.8373e-01, -4.0189e-01, 2.5233e-01, 1.4930e-01,
-1.9989e-03, 2.4316e-01, -1.1267e-01, -4.7194e-02, 2.2868e-01],
[-4.4270e-01, -2.5146e-01, 1.8103e-02, -1.1784e-01, 9.0070e-02,
-2.7046e-01, 1.5864e-01, 1.0738e-01, 5.2518e-03, -1.6656e-01],
[-3.4070e-01, -7.1199e-01, 3.2499e-01, -4.5099e-01, 1.8579e-02,
-9.1884e-01, 4.7906e-01, 2.8550e-02, 4.2099e-01, -3.9343e-01],
[-1.6497e-01, 9.5924e-01, -6.0456e-01, 5.6138e-01, 9.0732e-02,
6.0200e-01, -9.3879e-03, -1.4700e-01, -2.7646e-01, 5.6783e-01],
[-4.4853e-01, -4.2034e-01, 2.2960e-01, -2.9798e-01, 1.3834e-01,
-6.6684e-01, 4.3417e-01, 6.2197e-02, 1.2980e-01, -4.2036e-01],
[-3.6195e-01, -5.8910e-01, 2.5943e-01, -3.1415e-01, 9.1517e-02,
-7.5003e-01, 3.9000e-01, 7.4762e-02, 2.3163e-01, -3.9896e-01],
[-3.6552e-01, -2.1078e-01, 5.7047e-02, -2.2510e-01, 6.4404e-02,
-4.7013e-01, 2.9130e-01, 8.8827e-02, 3.8984e-02, -2.7132e-01],
[-6.8175e-01, 6.3034e-01, -5.3148e-01, 3.4757e-01, 2.1629e-01,
1.2112e-01, 3.0105e-01, -9.3755e-02, 4.3006e-02, 2.2544e-01],
[-6.1631e-02, 8.8187e-01, -6.0550e-01, 5.7708e-01, 8.8134e-02,
6.4667e-01, -9.7189e-02, -3.8907e-02, -2.0468e-01, 4.9839e-01],
[-3.9742e-01, -4.9334e-01, 2.7592e-01, -3.0464e-01, 1.3201e-01,
-7.5158e-01, 3.9522e-01, 1.1564e-01, 1.8720e-01, -4.6010e-01],
[-3.4826e-01, -5.4302e-01, 2.3106e-01, -3.1282e-01, 9.6388e-02,
-7.6846e-01, 3.8242e-01, 1.1184e-01, 3.0391e-01, -3.6247e-01],
[-5.7508e-01, -2.1254e-01, 9.6993e-02, -1.8350e-01, 1.7445e-01,
-2.8679e-01, 1.8445e-01, 1.5498e-01, 1.0922e-01, -2.2348e-01],
[-4.7849e-01, -2.5155e-01, -8.0201e-02, 5.7167e-02, -4.9126e-04,
-1.2727e-01, 1.0912e-01, 7.7717e-02, 1.4069e-01, -1.5423e-01],
[ 1.3228e-01, 8.4822e-01, -5.7432e-01, 5.7294e-01, 3.9404e-03,
7.7920e-01, -2.2635e-01, -1.5605e-02, -2.6260e-01, 4.9051e-01],
[-3.2101e-01, -6.8110e-01, 3.2382e-01, -4.2349e-01, 6.5490e-02,
-9.3271e-01, 4.8653e-01, 8.8333e-02, 4.0458e-01, -3.8155e-01],
[-6.3350e-01, -1.1693e-01, -2.6130e-01, 1.4510e-01, 6.1222e-02,
2.0786e-01, -1.0654e-02, 8.4277e-02, -2.7730e-02, 1.3080e-01]],
grad_fn=<AddmmBackward0>)
# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting an optimizer and accelerator¶
# Setting an accelerator and optimizer
from transformers import AdamW
from accelerate import Accelerator
accelerator = Accelerator()
optimizer = AdamW(model.parameters(), lr=1e-5)
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
# Prepare data for accelerator
train_dl, eval_dl, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)
Setting a scheduler¶
# Setting a learning rate scheduler
from transformers import get_scheduler
num_epochs = 2
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
print(num_training_steps)
2318
Verifying infrastructure settings¶
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
device(type='cpu')
Setting a progress bar to track the training stage¶
# Add a progress bar
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k:v.to(device) for k,v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
0%| | 0/3477 [00:00<?, ?it/s]
Setting the evaluation stage¶
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k,v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch['labels'])
metric.compute()
Downloading builder script: 0%| | 0.00/4.20k [00:00<?, ?B/s]
{'accuracy': 0.4532312007767828}
Full training loop with accelerate¶
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
# Model instantiation
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# Setting an Optimizer, Accelerator, and Scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)
accelerator = Accelerator()
train_dl, eval_dl, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
# Verifying infrastructure settings
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
device(type='cuda')
# Model training and evaluation
from tqdm.auto import tqdm
metric = evaluate.load("accuracy")
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
# Training
model.train()
for batch in train_dl:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
# Evaluation
model.eval()
for batch in eval_dl:
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
labels = batch["labels"]
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
metric.add_batch(predictions=predictions_gathered, references=labels_gathered)
results = metric.compute()
print(f"epoch {epoch}: {results['accuracy']}")
Downloading builder script: 0%| | 0.00/4.20k [00:00<?, ?B/s]
0%| | 0/3477 [00:00<?, ?it/s]
epoch 0: 0.4238860718524113
epoch 1: 0.4341352896752616
epoch 2: 0.43381163016506635
Preparing submission¶
Now that we have fine-tuned our classification model, it's time to check how it performs on the test dataset. Since we aren't using the Trainer API, it's necessary to pre-process data of the test dataset.
# Inspect the test dataset
raw_datasets['test'].features
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'labels': ClassLabel(names=['action'], id=None)}
# Convert the test dataset to a dataframe
test_raw_dataset = raw_datasets['test'][:]
test_raw_dataset.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36000 entries, 0 to 35999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 36000 non-null int64 1 movie_name 36000 non-null object 2 synopsis 36000 non-null object 3 labels 36000 non-null int64 dtypes: int64(2), object(2) memory usage: 10.3 MB
# Turn the test dataframe into a Dataset format again
test_ds = Dataset.from_pandas(test_raw_dataset)
test_ds.features
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'labels': Value(dtype='int64', id=None)}
# Turn "labels" column into a ClassType format
test_ds = test_ds.class_encode_column('labels')
test_ds.features
Stringifying the column: 0%| | 0/36000 [00:00<?, ? examples/s]
Casting to class labels: 0%| | 0/36000 [00:00<?, ? examples/s]
{'id': Value(dtype='int64', id=None),
'movie_name': Value(dtype='string', id=None),
'synopsis': Value(dtype='string', id=None),
'labels': ClassLabel(names=['0'], id=None)}
# Tokenize the test dataset
tokenized_test_ds = test_ds.map(tokenize, batched=True)
Map: 0%| | 0/36000 [00:00<?, ? examples/s]
# Inspect the content of the test dataset
tokenized_test_ds.column_names
['id', 'movie_name', 'synopsis', 'labels', 'input_ids', 'token_type_ids', 'attention_mask']
# Create a copy of the original test dataset
from copy import deepcopy
tokenized_test_ds_copy = deepcopy(tokenized_test_ds)
# Remove columns the model doesn't expect
tokenized_test_ds_copy = tokenized_test_ds_copy.remove_columns(['id', 'movie_name', 'synopsis'])
tokenized_test_ds_copy.column_names
['labels', 'input_ids', 'token_type_ids', 'attention_mask']
# Define a DataLoader for the test dataset
from torch.utils.data import DataLoader
test_dataloader = DataLoader(
tokenized_test_ds_copy, batch_size=64, collate_fn=data_collator
)
# Inspect a batch from train_dataloader
for batch in test_dataloader:
break
{k:v.shape for k,v in batch.items()}
{'labels': torch.Size([64]),
'input_ids': torch.Size([64, 73]),
'token_type_ids': torch.Size([64, 73]),
'attention_mask': torch.Size([64, 73])}
# Specify a device type since we aren't using accelerate for the prediction stage
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
device(type='cuda')
# Run the model to get predictions
num_eval_steps = len(test_dataloader)
progress_bar = tqdm(range(num_eval_steps))
predictions = []
model.eval()
for batch in test_dataloader:
batch = {k:v.to(device) for k,v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
batch_predictions = outputs.logits.argmax(dim=-1).tolist()
predictions.extend(batch_predictions)
progress_bar.update(1)
0%| | 0/563 [00:00<?, ?it/s]
# Display some predictions
print(predictions[:20])
[3, 5, 4, 6, 8, 1, 9, 2, 5, 4, 0, 7, 2, 4, 5, 0, 3, 3, 9, 8]
# Convert predictions to their string representations based on the mapping defined in the 'labels' feature.
predicted_genre = raw_datasets['train'].features['labels'].int2str(predictions)
# Create a dataframe specifying movie id and genre
df = pd.DataFrame({'id':tokenized_test_ds['id'], 'genre':predicted_genre})
df.head(3)
| id | genre | |
|---|---|---|
| 0 | 16863 | family |
| 1 | 48456 | horror |
| 2 | 41383 | fantasy |
# Save results to a csv file
df.to_csv('submission.csv')