Analyzing Movie Genre Predictions through the Lens of Hugging Face Transformers and a Training Loop Approach¶

by Salomon Marquez

01/07/2024

This notebook proposes a solution to the Movie Genre Prediction competition from Hugging Face using a BERT-based model to classify movie genres based on their title and synopses. It implements a training loop manually instead of using the trainer API from Hugging Face. This choice aims to improve the fine-tuning phase by manually setting and optimizing selected hyperparameters.

The resulted predicted scores obtained by the fine-tuned model are as follows:

Public Score: 0.4260611
Private Score: 0.4184444

The fine-tuning stage required 1.5 compute units of type T4 GPU. The execution time for this task took 23 min, utilizing the provided movie dataset. Subsequently, the prediction stage required approximately 2.5 min.

This notebook was inspired by Anubhav's solution and the concepts acquired from the Hugging Face NLP course.

In [ ]:

Copied!





# View the infrastructure provided by Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)
# View the infrastructure provided by Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Jan 11 11:08:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

In [ ]:

Copied!





# View the assigned RAM memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')
# View the assigned RAM memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime

In [ ]:

Copied!





# Install libraries and modules
!pip install evaluate datasets transformers[sentencepiece]
!pip install accelerate -U
!pip install huggingface_hub
# Install libraries and modules
!pip install evaluate datasets transformers[sentencepiece]
!pip install accelerate -U
!pip install huggingface_hub

Requirement already satisfied: evaluate in /usr/local/lib/python3.10/dist-packages (0.4.1)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.16.1)
Requirement already satisfied: transformers[sentencepiece] in /usr/local/lib/python3.10/dist-packages (4.35.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.23.5)
Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.3.7)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from evaluate) (4.66.1)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from evaluate) (3.4.1)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.70.15)
Requirement already satisfied: fsspec[http]>=2021.05.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2023.6.0)
Requirement already satisfied: huggingface-hub>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.20.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from evaluate) (23.2)
Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.18.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (10.0.1)
Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (2023.6.3)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (0.4.1)
Requirement already satisfied: sentencepiece!=0.1.92,>=0.1.91 in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (0.1.99)
Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from transformers[sentencepiece]) (3.20.3)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.7.0->evaluate) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2023.11.17)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2023.3.post1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0)
Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.26.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (23.2)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.1)
Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.1.0+cu121)
Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.20.2)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.4.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.2.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.1.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->accelerate) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->accelerate) (4.66.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (2023.11.17)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0)
Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.10/dist-packages (0.20.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (3.13.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2023.6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.66.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.5.0)
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (23.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2023.11.17)

In [ ]:

Copied!

# Check transformers and accelerate modules version
import transformers
import accelerate

transformers.__version__, accelerate.__version__
# Check transformers and accelerate modules version
import transformers
import accelerate

transformers.__version__, accelerate.__version__

Out[ ]:

('4.35.2', '0.26.0')

In [ ]:

Copied!

# Login to Huggingface_hub to access movie dataset
from huggingface_hub import notebook_login
notebook_login()
# Login to Huggingface_hub to access movie dataset
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [ ]:

Copied!





# Import libraries and packages
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset, Dataset
from collections import Counter
import evaluate

import numpy as np
import pandas as pd
from rich import print
# Import libraries and packages
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset, Dataset
from collections import Counter
import evaluate

import numpy as np
import pandas as pd
from rich import print

Loading Movie Datasets¶

In [ ]:

Copied!

# Load competition datasets
raw_datasets = load_dataset("datadrivenscience/movie-genre-prediction")
raw_datasets
# Load competition datasets
raw_datasets = load_dataset("datadrivenscience/movie-genre-prediction")
raw_datasets

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

Downloading readme:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.16M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.74M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/54000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/36000 [00:00<?, ? examples/s]

Out[ ]:

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 54000
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 36000
    })
})

In [ ]:

Copied!

# Explore train dataset
raw_datasets["train"].features
# Explore train dataset
raw_datasets["train"].features

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None)}

In [ ]:

Copied!

# Explore train dataset
raw_datasets["train"][:5]
# Explore train dataset
raw_datasets["train"][:5]

Out[ ]:

{'id': [44978, 50185, 34131, 78522, 2206],
 'movie_name': ['Super Me',
  'Entity Project',
  'Behavioral Family Therapy for Serious Psychiatric Disorders',
  'Blood Glacier',
  'Apat na anino'],
 'synopsis': ['A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.',
  'A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular.',
  'This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses.',
  'Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife.',
  'Buy Day - Four Men Widely - Apart in Life - By Night Shadows United in One Fight Venting the Fire of their Fury Against the Hated Oppressors.'],
 'genre': ['fantasy', 'horror', 'family', 'scifi', 'action']}

In [ ]:

Copied!

# Explore test dataset
raw_datasets["test"].features
# Explore test dataset
raw_datasets["test"].features

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None)}

In [ ]:

Copied!

# Explore test dataset
raw_datasets["test"][:5]
# Explore test dataset
raw_datasets["test"][:5]

Out[ ]:

{'id': [16863, 48456, 41383, 84007, 40269],
 'movie_name': ['A Death Sentence',
  'Intermedio',
  '30 Chua Phai Tet',
  'Paranoiac',
  'Ordinary Happiness'],
 'synopsis': ["12 y.o. Ida's dad'll die without a DKK1,500,000 operation. Ida plans to steal the money from the bank, her mom installed alarm systems in. She'll need her climbing skills, her 2 friends and 3 go-karts.",
  'A group of four teenage friends become trapped in a Mexican border tunnel where they fall prey, one-by one, to tortured ghosts who haunt it.',
  "A guy left his home for 12 years till he came back to claim what's his from his father, the vast Land, just to uncover that he had to live that day, year-end Lunar day, for another 12 years.",
  'A man long believed dead returns to the family estate to claim his inheritance.',
  'After a deadly accident, Paolo comes back on Earth just 92 minutes more, thanks to a calculation error made in a paradise office.'],
 'genre': ['action', 'action', 'action', 'action', 'action']}

What are the existing movie genres?¶

In [ ]:

Copied!





# Identifying the existing genres in the train dataset
labels = set(raw_datasets["train"]["genre"])
num_labels = len(labels)
num_labels, labels
# Identifying the existing genres in the train dataset
labels = set(raw_datasets["train"]["genre"])
num_labels = len(labels)
num_labels, labels

Out[ ]:

(10,
 {'action',
  'adventure',
  'crime',
  'family',
  'fantasy',
  'horror',
  'mystery',
  'romance',
  'scifi',
  'thriller'})

In [ ]:

Copied!

# Counting the number of movies per genre in the train dataset
labels_count = Counter(raw_datasets['train']['genre'])
print(labels_count)
# Counting the number of movies per genre in the train dataset
labels_count = Counter(raw_datasets['train']['genre'])
print(labels_count)

Counter({
    'fantasy': 5400,
    'horror': 5400,
    'family': 5400,
    'scifi': 5400,
    'action': 5400,
    'crime': 5400,
    'adventure': 5400,
    'mystery': 5400,
    'romance': 5400,
    'thriller': 5400
})

In [ ]:

Copied!

# Counting the number of movies per genre in the test dataset
labels_count_test = Counter(raw_datasets['test']['genre'])
print(labels_count_test)
# Counting the number of movies per genre in the test dataset
labels_count_test = Counter(raw_datasets['test']['genre'])
print(labels_count_test)

Counter({'action': 36000})

In [ ]:

Copied!





# Rename "genre" column as "labels" in the train dataset and turn into a ClassLabel type
raw_datasets = raw_datasets.rename_column('genre','labels')
raw_datasets = raw_datasets.class_encode_column('labels')
raw_datasets['train'].features
# Rename "genre" column as "labels" in the train dataset and turn into a ClassLabel type
raw_datasets = raw_datasets.rename_column('genre','labels')
raw_datasets = raw_datasets.class_encode_column('labels')
raw_datasets['train'].features

Casting to class labels:   0%|          | 0/54000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/36000 [00:00<?, ? examples/s]

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['action', 'adventure', 'crime', 'family', 'fantasy', 'horror', 'mystery', 'romance', 'scifi', 'thriller'], id=None)}

Answer: The train dataset contains 10 genres that seemed to be evenly distributed accross the dataset. Meanwhile, the test dataset only contains action movies as dummy values prior inference.

Removing Duplicated Items¶

In [ ]:

Copied!

# Convert Datasets into Dataframes
raw_datasets.set_format('pandas')
# Convert Datasets into Dataframes
raw_datasets.set_format('pandas')

In [ ]:

Copied!

# Convert Datasets into Dataframes
train_dataset = raw_datasets['train'][:]
train_dataset.info(memory_usage='deep')
# Convert Datasets into Dataframes
train_dataset = raw_datasets['train'][:]
train_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54000 entries, 0 to 53999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54000 non-null  int64 
 1   movie_name  54000 non-null  object
 2   synopsis    54000 non-null  object
 3   labels      54000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 15.4 MB

In [ ]:

Copied!

train_dataset.head(3)
train_dataset.head(3)

Out[ ]:

	id	movie_name	synopsis	labels
0	44978	Super Me	A young scriptwriter starts bringing valuable ...	4
1	50185	Entity Project	A director and her friends renting a haunted h...	5
2	34131	Behavioral Family Therapy for Serious Psychiat...	This is an educational video for families and ...	3

In [ ]:

Copied!

# Drop duplicates from the train dataframe
train_dataset = train_dataset.drop_duplicates(['movie_name', 'synopsis'])
train_dataset.info(memory_usage = 'deep')
# Drop duplicates from the train dataframe
train_dataset = train_dataset.drop_duplicates(['movie_name', 'synopsis'])
train_dataset.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46344 entries, 0 to 53998
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          46344 non-null  int64 
 1   movie_name  46344 non-null  object
 2   synopsis    46344 non-null  object
 3   labels      46344 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 13.8 MB

Answer: The train dataset cointained 7,656 duplicates.

Analyzing text movie titles and their synopses¶

This section analysis the length of movie titles and their synopses.

In [ ]:

Copied!

train_dataset.head(3)
train_dataset.head(3)

Out[ ]:

	id	movie_name	synopsis	labels
0	44978	Super Me	A young scriptwriter starts bringing valuable ...	4
1	50185	Entity Project	A director and her friends renting a haunted h...	5
2	34131	Behavioral Family Therapy for Serious Psychiat...	This is an educational video for families and ...	3

In [ ]:

Copied!

# Create a new column "synopsis_len" that contains the synopsis length
train_dataset['synopsis_len'] = train_dataset['synopsis'].apply(lambda x: len(x))
train_dataset.head(3)
# Create a new column "synopsis_len" that contains the synopsis length
train_dataset['synopsis_len'] = train_dataset['synopsis'].apply(lambda x: len(x))
train_dataset.head(3)

Out[ ]:

	id	movie_name	synopsis	labels	synopsis_len
0	44978	Super Me	A young scriptwriter starts bringing valuable ...	4	141
1	50185	Entity Project	A director and her friends renting a haunted h...	5	120
2	34131	Behavioral Family Therapy for Serious Psychiat...	This is an educational video for families and ...	3	164

In [ ]:

Copied!

# Create a new column "movie_name_len" that contains the length of movie_name
train_dataset['movie_name_len'] = train_dataset['movie_name'].apply(lambda x: len(x))
train_dataset.head(3)
# Create a new column "movie_name_len" that contains the length of movie_name
train_dataset['movie_name_len'] = train_dataset['movie_name'].apply(lambda x: len(x))
train_dataset.head(3)

Out[ ]:

	id	movie_name	synopsis	labels	synopsis_len	movie_name_len
0	44978	Super Me	A young scriptwriter starts bringing valuable ...	4	141	8
1	50185	Entity Project	A director and her friends renting a haunted h...	5	120	14
2	34131	Behavioral Family Therapy for Serious Psychiat...	This is an educational video for families and ...	3	164	59

In [ ]:

Copied!

# Order train dataframe by synopsis_len
train_dataset.sort_values(by='synopsis_len', ascending=False)
# Order train dataframe by synopsis_len
train_dataset.sort_values(by='synopsis_len', ascending=False)

Out[ ]:

	id	movie_name	synopsis	labels	synopsis_len	movie_name_len
52518	46444	Final Destination	Alex Browning is among a group of high school ...	5	400	17
49498	1468	Bhargava Ramudu	Bhargava, an efficient, yet jobless young man ...	0	395	15
29141	71309	Krishnatulasi	Krishna is a blind young man who works as a gu...	7	381	13
50834	44856	The Sex Cycle	The Cocoa Poodle bar is the central meeting pl...	4	377	13
53370	4779	Uro	Turning his back on a delinquent past and join...	0	370	3
...	...	...	...	...	...	...
6891	71298	Qismat 2	Fortune 2.	7	10	8
38284	5454	Rader	Invasion.	0	9	5
3301	15654	Adventure Night	TBD	1	3	15
34698	42213	Dark Army	NA.	4	3	9
26774	34314	Prima Ballerina	TBA	3	3	15

46344 rows × 6 columns

In [ ]:

Copied!





import plotly.figure_factory as ff
fig = ff.create_distplot([train_dataset['synopsis_len']], ['length'], colors=['#2ca02c'])
fig.update_layout(title_text='Word Count Distribution of Movie Synopsis')
fig.show()
import plotly.figure_factory as ff
fig = ff.create_distplot([train_dataset['synopsis_len']], ['length'], colors=['#2ca02c'])
fig.update_layout(title_text='Word Count Distribution of Movie Synopsis')
fig.show()

In [ ]:

Copied!

fig2 = ff.create_distplot([train_dataset['movie_name_len']], ['length'], colors=['#ffa408'])
fig2.update_layout(title_text='Word Count Distribution of Movie Titles')
fig2.show()
fig2 = ff.create_distplot([train_dataset['movie_name_len']], ['length'], colors=['#ffa408'])
fig2.update_layout(title_text='Word Count Distribution of Movie Titles')
fig2.show()

In [ ]:

Copied!

train_dataset['movie_name_len'].max(), train_dataset['synopsis_len'].max()
train_dataset['movie_name_len'].max(), train_dataset['synopsis_len'].max()

Out[ ]:

(180, 400)

The average length of characters in the movie name is 12. For the synopsis, we see two peaks around 145 and 230 characters. The maximum character size is 180 for the movie name and 400 for the synopsis. So, there won't be any issues during tokenization and training because the bert-base-uncased model supports a maximum character length of 512.

Tokenization¶

In [ ]:

Copied!

# Convert the train_dataset Dataframe to DataSet format again
train_ds = Dataset.from_pandas(train_dataset)
train_ds.features
# Convert the train_dataset Dataframe to DataSet format again
train_ds = Dataset.from_pandas(train_dataset)
train_ds.features

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': Value(dtype='int64', id=None),
 'synopsis_len': Value(dtype='int64', id=None),
 'movie_name_len': Value(dtype='int64', id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

In [ ]:

Copied!

# Turn "labels" column into ClassLabel type
train_ds = train_ds.class_encode_column('labels')
train_ds.features
# Turn "labels" column into ClassLabel type
train_ds = train_ds.class_encode_column('labels')
train_ds.features

Stringifying the column:   0%|          | 0/46344 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/46344 [00:00<?, ? examples/s]

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None),
 'synopsis_len': Value(dtype='int64', id=None),
 'movie_name_len': Value(dtype='int64', id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

In [ ]:

Copied!





# Create tokenizer
# i.e. bert-base-uncased, bert-large-uncased, bert-large-uncased-whole-word-masking
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer
# Create tokenizer
# i.e. bert-base-uncased, bert-large-uncased, bert-large-uncased-whole-word-masking
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Out[ ]:

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [ ]:

Copied!

# Do a sample tokenization
sample_tokenized = tokenizer(train_ds['movie_name'][0], train_ds['synopsis'][0])
tokenizer.decode(sample_tokenized['input_ids'])
# Do a sample tokenization
sample_tokenized = tokenizer(train_ds['movie_name'][0], train_ds['synopsis'][0])
tokenizer.decode(sample_tokenized['input_ids'])

Out[ ]:

'[CLS] super me [SEP] a young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. selling them makes him rich. [SEP]'

In [ ]:

Copied!

sample_tokenized
sample_tokenized

Out[ ]:

{'input_ids': [101, 3565, 2033, 102, 1037, 2402, 5896, 15994, 4627, 5026, 7070, 5200, 2067, 2013, 2010, 2460, 15446, 1997, 2108, 13303, 2011, 1037, 5698, 1012, 4855, 2068, 3084, 2032, 4138, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [ ]:

Copied!

# Split Train Dataset (train_ds) into training and test datasets
train_ds = train_ds.train_test_split(test_size=0.2, stratify_by_column="labels")
train_ds
# Split Train Dataset (train_ds) into training and test datasets
train_ds = train_ds.train_test_split(test_size=0.2, stratify_by_column="labels")
train_ds

Out[ ]:

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__'],
        num_rows: 37075
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__'],
        num_rows: 9269
    })
})

In [ ]:

Copied!

# Define a tokenize function
def tokenize(ds):
  return tokenizer(ds['movie_name'], ds['synopsis'], truncation=True)
# Define a tokenize function
def tokenize(ds):
  return tokenizer(ds['movie_name'], ds['synopsis'], truncation=True)

In [ ]:

Copied!

# Tokenize train_ds
tokenized_datasets = train_ds.map(tokenize, batched=True)
tokenized_datasets
# Tokenize train_ds
tokenized_datasets = train_ds.map(tokenize, batched=True)
tokenized_datasets

Map:   0%|          | 0/37075 [00:00<?, ? examples/s]

Map:   0%|          | 0/9269 [00:00<?, ? examples/s]

Out[ ]:

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 37075
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9269
    })
})

In [ ]:

Copied!

# Select a random sample to verify tokenization
tokenizer.decode(tokenized_datasets['train']['input_ids'][37074])
# Select a random sample to verify tokenization
tokenizer.decode(tokenized_datasets['train']['input_ids'][37074])

Out[ ]:

'[CLS] begum [SEP] a sheltered beauty, begum, is introduced to the enchanting world of bollywood by the enigmatic madan where she discovers true freedom and love come at the price of her passion and life. [SEP]'

Preparing data for the training stage¶

In [ ]:

Copied!

# Removing columns the model doesn't expect
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'movie_name', 'synopsis', 'synopsis_len', 'movie_name_len','__index_level_0__'])
tokenized_datasets['train'].column_names
# Removing columns the model doesn't expect
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'movie_name', 'synopsis', 'synopsis_len', 'movie_name_len','__index_level_0__'])
tokenized_datasets['train'].column_names

Out[ ]:

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [ ]:

Copied!

# Setting the datasets format so that they can return Pytorch tensors
tokenized_datasets.set_format("torch")
# Setting the datasets format so that they can return Pytorch tensors
tokenized_datasets.set_format("torch")

In [ ]:

Copied!

# Define a data_collator function for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Define a data_collator function for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [ ]:

Copied!





# Defining DataLoaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets['train'], shuffle=True, batch_size=32, collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets['test'], batch_size=64, collate_fn=data_collator
)
# Defining DataLoaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets['train'], shuffle=True, batch_size=32, collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets['test'], batch_size=64, collate_fn=data_collator
)

In [ ]:

Copied!





# Inspecting a batch from train_dataloader
for batch in train_dataloader:
  break
{k:v.shape for k,v in batch.items()}
# Inspecting a batch from train_dataloader
for batch in train_dataloader:
  break
{k:v.shape for k,v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Out[ ]:

{'labels': torch.Size([32]),
 'input_ids': torch.Size([32, 65]),
 'token_type_ids': torch.Size([32, 65]),
 'attention_mask': torch.Size([32, 65])}

In [ ]:

Copied!

# Inspecting a batch from train_dataloader
batch.input_ids
# Inspecting a batch from train_dataloader
batch.input_ids

Out[ ]:

tensor([[  101,  3019,  5320,  ...,     0,     0,     0],
        [  101,  1051, 10381,  ...,     0,     0,     0],
        [  101, 13970, 13278,  ...,     0,     0,     0],
        ...,
        [  101, 14477,  9587,  ...,     0,     0,     0],
        [  101,  1037,  3543,  ...,     0,     0,     0],
        [  101, 15274,  1004,  ...,     0,     0,     0]])

Step-by-step setting of the training stage¶

Model Instantiation¶

In [ ]:

Copied!

# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In [ ]:

Copied!

# Passing a single batch to our model to check that everything is OK
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)
# Passing a single batch to our model to check that everything is OK
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(2.3496, grad_fn=<NllLossBackward0>)
torch.Size([32, 10])

Note: When labels are provided, HF transformers will return the loss and the logits (two for each input)

In [ ]:

Copied!

outputs.logits
outputs.logits

Out[ ]:

tensor([[-4.0785e-01, -5.9843e-01,  2.3831e-01, -3.6281e-01,  7.7003e-02,
         -7.9042e-01,  3.7949e-01,  6.3509e-02,  2.5117e-01, -4.0870e-01],
        [-4.7463e-01,  8.1971e-01, -5.7489e-01,  2.9883e-01, -5.2216e-03,
          2.9287e-01,  1.3395e-01, -1.2713e-01, -6.2679e-03,  2.6193e-01],
        [-3.2331e-01, -7.0246e-01,  3.2733e-01, -4.3661e-01,  1.2958e-02,
         -9.1787e-01,  4.7303e-01,  3.7585e-02,  4.2619e-01, -4.0903e-01],
        [-3.8950e-01, -5.9636e-01,  1.9592e-01, -3.3946e-01,  6.1083e-02,
         -7.2596e-01,  4.0944e-01,  6.8973e-02,  2.9999e-01, -4.2091e-01],
        [-3.8568e-01, -5.6021e-01,  1.7959e-01, -3.0257e-01,  6.5529e-02,
         -7.6407e-01,  3.8535e-01,  2.6806e-02,  2.6749e-01, -4.2585e-01],
        [-3.6171e-01, -6.5866e-01,  3.1120e-01, -3.8170e-01,  5.1930e-02,
         -8.8321e-01,  4.5273e-01,  5.1025e-02,  3.6372e-01, -4.0050e-01],
        [-4.2266e-01, -4.7145e-01,  9.5230e-02, -2.3389e-01,  1.3982e-01,
         -6.2665e-01,  3.7735e-01,  9.9706e-02,  1.8753e-01, -4.0756e-01],
        [-3.1637e-01,  8.8094e-01, -5.1321e-01,  5.4990e-01,  1.2587e-01,
          3.7881e-01,  6.1411e-02, -1.5915e-01, -1.2994e-01,  4.6542e-01],
        [-3.8507e-01,  8.0821e-01, -4.5865e-01,  4.5561e-01,  7.2987e-02,
          2.9142e-01,  1.0862e-01, -1.3985e-01, -5.3073e-02,  3.0885e-01],
        [-3.1688e-01,  8.4953e-01, -6.1110e-01,  4.8324e-01,  3.0182e-02,
          5.0664e-01,  1.1268e-01, -1.3150e-01, -1.6073e-01,  2.8703e-01],
        [-5.1548e-01, -1.6234e-01, -1.9065e-02, -1.6256e-01,  7.9030e-02,
         -2.1718e-01,  2.8946e-01,  1.8431e-01,  5.1388e-02, -9.4204e-02],
        [-4.1334e-01, -3.1754e-01,  3.5393e-02, -2.6649e-02,  4.9808e-02,
         -1.2166e-01,  4.8317e-02,  1.0462e-01,  1.7923e-01, -1.8917e-01],
        [-4.2132e-01, -4.9542e-01,  2.3680e-01, -2.0575e-01,  1.0921e-01,
         -5.6644e-01,  2.0891e-01,  1.9182e-01,  2.1404e-01, -4.3724e-01],
        [-6.1001e-01,  2.9906e-01, -2.8272e-01, -5.8040e-03,  1.1232e-01,
         -5.5117e-03,  2.3484e-01, -6.8720e-02,  1.3603e-01,  2.2902e-01],
        [-4.2867e-01, -3.6328e-01,  1.3073e-01, -2.2871e-01,  7.4158e-02,
         -5.7429e-01,  3.9471e-01,  1.5585e-01,  1.6983e-01, -3.9815e-01],
        [-3.4650e-01, -6.8317e-01,  3.0791e-01, -4.1893e-01,  3.2079e-02,
         -8.8611e-01,  4.6996e-01,  6.2504e-02,  3.8550e-01, -3.9998e-01],
        [-5.0109e-01,  8.8373e-01, -4.0189e-01,  2.5233e-01,  1.4930e-01,
         -1.9989e-03,  2.4316e-01, -1.1267e-01, -4.7194e-02,  2.2868e-01],
        [-4.4270e-01, -2.5146e-01,  1.8103e-02, -1.1784e-01,  9.0070e-02,
         -2.7046e-01,  1.5864e-01,  1.0738e-01,  5.2518e-03, -1.6656e-01],
        [-3.4070e-01, -7.1199e-01,  3.2499e-01, -4.5099e-01,  1.8579e-02,
         -9.1884e-01,  4.7906e-01,  2.8550e-02,  4.2099e-01, -3.9343e-01],
        [-1.6497e-01,  9.5924e-01, -6.0456e-01,  5.6138e-01,  9.0732e-02,
          6.0200e-01, -9.3879e-03, -1.4700e-01, -2.7646e-01,  5.6783e-01],
        [-4.4853e-01, -4.2034e-01,  2.2960e-01, -2.9798e-01,  1.3834e-01,
         -6.6684e-01,  4.3417e-01,  6.2197e-02,  1.2980e-01, -4.2036e-01],
        [-3.6195e-01, -5.8910e-01,  2.5943e-01, -3.1415e-01,  9.1517e-02,
         -7.5003e-01,  3.9000e-01,  7.4762e-02,  2.3163e-01, -3.9896e-01],
        [-3.6552e-01, -2.1078e-01,  5.7047e-02, -2.2510e-01,  6.4404e-02,
         -4.7013e-01,  2.9130e-01,  8.8827e-02,  3.8984e-02, -2.7132e-01],
        [-6.8175e-01,  6.3034e-01, -5.3148e-01,  3.4757e-01,  2.1629e-01,
          1.2112e-01,  3.0105e-01, -9.3755e-02,  4.3006e-02,  2.2544e-01],
        [-6.1631e-02,  8.8187e-01, -6.0550e-01,  5.7708e-01,  8.8134e-02,
          6.4667e-01, -9.7189e-02, -3.8907e-02, -2.0468e-01,  4.9839e-01],
        [-3.9742e-01, -4.9334e-01,  2.7592e-01, -3.0464e-01,  1.3201e-01,
         -7.5158e-01,  3.9522e-01,  1.1564e-01,  1.8720e-01, -4.6010e-01],
        [-3.4826e-01, -5.4302e-01,  2.3106e-01, -3.1282e-01,  9.6388e-02,
         -7.6846e-01,  3.8242e-01,  1.1184e-01,  3.0391e-01, -3.6247e-01],
        [-5.7508e-01, -2.1254e-01,  9.6993e-02, -1.8350e-01,  1.7445e-01,
         -2.8679e-01,  1.8445e-01,  1.5498e-01,  1.0922e-01, -2.2348e-01],
        [-4.7849e-01, -2.5155e-01, -8.0201e-02,  5.7167e-02, -4.9126e-04,
         -1.2727e-01,  1.0912e-01,  7.7717e-02,  1.4069e-01, -1.5423e-01],
        [ 1.3228e-01,  8.4822e-01, -5.7432e-01,  5.7294e-01,  3.9404e-03,
          7.7920e-01, -2.2635e-01, -1.5605e-02, -2.6260e-01,  4.9051e-01],
        [-3.2101e-01, -6.8110e-01,  3.2382e-01, -4.2349e-01,  6.5490e-02,
         -9.3271e-01,  4.8653e-01,  8.8333e-02,  4.0458e-01, -3.8155e-01],
        [-6.3350e-01, -1.1693e-01, -2.6130e-01,  1.4510e-01,  6.1222e-02,
          2.0786e-01, -1.0654e-02,  8.4277e-02, -2.7730e-02,  1.3080e-01]],
       grad_fn=<AddmmBackward0>)

In [ ]:

Copied!

# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Setting an optimizer and accelerator¶

In [ ]:

Copied!





# Setting an accelerator and optimizer
from transformers import AdamW
from accelerate import Accelerator

accelerator = Accelerator()
optimizer = AdamW(model.parameters(), lr=1e-5)
# Setting an accelerator and optimizer
from transformers import AdamW
from accelerate import Accelerator

accelerator = Accelerator()
optimizer = AdamW(model.parameters(), lr=1e-5)

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning:

This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning

In [ ]:

Copied!





# Prepare data for accelerator
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)
# Prepare data for accelerator
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

Setting a scheduler¶

In [ ]:

Copied!





# Setting a learning rate scheduler
from transformers import get_scheduler

num_epochs = 2
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)
# Setting a learning rate scheduler
from transformers import get_scheduler

num_epochs = 2
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)

Verifying infrastructure settings¶

In [ ]:

Copied!





import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

Out[ ]:

device(type='cpu')

Setting a progress bar to track the training stage¶

In [ ]:

Copied!





# Add a progress bar
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch = {k:v.to(device) for k,v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)
# Add a progress bar
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch = {k:v.to(device) for k,v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  0%|          | 0/3477 [00:00<?, ?it/s]

Setting the evaluation stage¶

In [ ]:

Copied!





metric = evaluate.load("accuracy")
model.eval()

for batch in eval_dataloader:
  batch = {k: v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  metric.add_batch(predictions=predictions, references=batch['labels'])

metric.compute()
metric = evaluate.load("accuracy")
model.eval()

for batch in eval_dataloader:
  batch = {k: v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  metric.add_batch(predictions=predictions, references=batch['labels'])

metric.compute()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Out[ ]:

{'accuracy': 0.4532312007767828}

Full training loop with accelerate¶

In [ ]:

Copied!

from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

In [ ]:

Copied!

# Model instantiation
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
# Model instantiation
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In [ ]:

Copied!





# Setting an Optimizer, Accelerator, and Scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)
accelerator = Accelerator()

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
# Setting an Optimizer, Accelerator, and Scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)
accelerator = Accelerator()

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning:

This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning

In [ ]:

Copied!





# Verifying infrastructure settings
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
# Verifying infrastructure settings
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

Out[ ]:

device(type='cuda')

In [ ]:

Copied!





# Model training and evaluation
from tqdm.auto import tqdm

metric = evaluate.load("accuracy")
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
  # Training
  model.train()
  for batch in train_dl:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  # Evaluation
  model.eval()
  for batch in eval_dl:
    with torch.no_grad():
      outputs = model(**batch)

    predictions = outputs.logits.argmax(dim=-1)
    labels = batch["labels"]

    predictions_gathered = accelerator.gather(predictions)
    labels_gathered = accelerator.gather(labels)
    metric.add_batch(predictions=predictions_gathered, references=labels_gathered)

  results = metric.compute()
  print(f"epoch {epoch}: {results['accuracy']}")
# Model training and evaluation
from tqdm.auto import tqdm

metric = evaluate.load("accuracy")
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
  # Training
  model.train()
  for batch in train_dl:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  # Evaluation
  model.eval()
  for batch in eval_dl:
    with torch.no_grad():
      outputs = model(**batch)

    predictions = outputs.logits.argmax(dim=-1)
    labels = batch["labels"]

    predictions_gathered = accelerator.gather(predictions)
    labels_gathered = accelerator.gather(labels)
    metric.add_batch(predictions=predictions_gathered, references=labels_gathered)

  results = metric.compute()
  print(f"epoch {epoch}: {results['accuracy']}")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  0%|          | 0/3477 [00:00<?, ?it/s]

epoch 0: 0.4238860718524113

epoch 1: 0.4341352896752616

epoch 2: 0.43381163016506635

Preparing submission¶

Now that we have fine-tuned our classification model, it's time to check how it performs on the test dataset. Since we aren't using the Trainer API, it's necessary to pre-process data of the test dataset.

In [ ]:

Copied!

# Inspect the test dataset
raw_datasets['test'].features
# Inspect the test dataset
raw_datasets['test'].features

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['action'], id=None)}

In [ ]:

Copied!

# Convert the test dataset to a dataframe
test_raw_dataset = raw_datasets['test'][:]
test_raw_dataset.info(memory_usage='deep')
# Convert the test dataset to a dataframe
test_raw_dataset = raw_datasets['test'][:]
test_raw_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36000 entries, 0 to 35999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          36000 non-null  int64 
 1   movie_name  36000 non-null  object
 2   synopsis    36000 non-null  object
 3   labels      36000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 10.3 MB

In [ ]:

Copied!

# Turn the test dataframe into a Dataset format again
test_ds = Dataset.from_pandas(test_raw_dataset)
test_ds.features
# Turn the test dataframe into a Dataset format again
test_ds = Dataset.from_pandas(test_raw_dataset)
test_ds.features

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': Value(dtype='int64', id=None)}

In [ ]:

Copied!

# Turn "labels" column into a ClassType format
test_ds = test_ds.class_encode_column('labels')
test_ds.features
# Turn "labels" column into a ClassType format
test_ds = test_ds.class_encode_column('labels')
test_ds.features

Stringifying the column:   0%|          | 0/36000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/36000 [00:00<?, ? examples/s]

Out[ ]:

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['0'], id=None)}

In [ ]:

Copied!

# Tokenize the test dataset
tokenized_test_ds = test_ds.map(tokenize, batched=True)
# Tokenize the test dataset
tokenized_test_ds = test_ds.map(tokenize, batched=True)

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

In [ ]:

Copied!

# Inspect the content of the test dataset
tokenized_test_ds.column_names
# Inspect the content of the test dataset
tokenized_test_ds.column_names

Out[ ]:

['id',
 'movie_name',
 'synopsis',
 'labels',
 'input_ids',
 'token_type_ids',
 'attention_mask']

In [ ]:

Copied!

# Create a copy of the original test dataset
from copy import deepcopy
tokenized_test_ds_copy = deepcopy(tokenized_test_ds)
# Create a copy of the original test dataset
from copy import deepcopy
tokenized_test_ds_copy = deepcopy(tokenized_test_ds)

In [ ]:

Copied!

# Remove columns the model doesn't expect
tokenized_test_ds_copy = tokenized_test_ds_copy.remove_columns(['id', 'movie_name', 'synopsis'])
tokenized_test_ds_copy.column_names
# Remove columns the model doesn't expect
tokenized_test_ds_copy = tokenized_test_ds_copy.remove_columns(['id', 'movie_name', 'synopsis'])
tokenized_test_ds_copy.column_names

Out[ ]:

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [ ]:

Copied!





# Define a DataLoader for the test dataset
from torch.utils.data import DataLoader
test_dataloader = DataLoader(
    tokenized_test_ds_copy, batch_size=64, collate_fn=data_collator
)
# Define a DataLoader for the test dataset
from torch.utils.data import DataLoader
test_dataloader = DataLoader(
    tokenized_test_ds_copy, batch_size=64, collate_fn=data_collator
)

In [ ]:

Copied!





# Inspect a batch from train_dataloader
for batch in test_dataloader:
  break
{k:v.shape for k,v in batch.items()}
# Inspect a batch from train_dataloader
for batch in test_dataloader:
  break
{k:v.shape for k,v in batch.items()}

Out[ ]:

{'labels': torch.Size([64]),
 'input_ids': torch.Size([64, 73]),
 'token_type_ids': torch.Size([64, 73]),
 'attention_mask': torch.Size([64, 73])}

In [ ]:

Copied!





# Specify a device type since we aren't using accelerate for the prediction stage
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
# Specify a device type since we aren't using accelerate for the prediction stage
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

Out[ ]:

device(type='cuda')

In [ ]:

Copied!





# Run the model to get predictions
num_eval_steps = len(test_dataloader)
progress_bar = tqdm(range(num_eval_steps))

predictions = []
model.eval()
for batch in test_dataloader:
  batch = {k:v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  batch_predictions = outputs.logits.argmax(dim=-1).tolist()
  predictions.extend(batch_predictions)
  progress_bar.update(1)
# Run the model to get predictions
num_eval_steps = len(test_dataloader)
progress_bar = tqdm(range(num_eval_steps))

predictions = []
model.eval()
for batch in test_dataloader:
  batch = {k:v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  batch_predictions = outputs.logits.argmax(dim=-1).tolist()
  predictions.extend(batch_predictions)
  progress_bar.update(1)

  0%|          | 0/563 [00:00<?, ?it/s]

In [ ]:

Copied!

# Display some predictions
print(predictions[:20])
# Display some predictions
print(predictions[:20])

[3, 5, 4, 6, 8, 1, 9, 2, 5, 4, 0, 7, 2, 4, 5, 0, 3, 3, 9, 8]

In [ ]:

Copied!

# Convert predictions to their string representations based on the mapping defined in the 'labels' feature.
predicted_genre = raw_datasets['train'].features['labels'].int2str(predictions)
# Convert predictions to their string representations based on the mapping defined in the 'labels' feature.
predicted_genre = raw_datasets['train'].features['labels'].int2str(predictions)

In [ ]:

Copied!

# Create a dataframe specifying movie id and genre
df = pd.DataFrame({'id':tokenized_test_ds['id'], 'genre':predicted_genre})
df.head(3)
# Create a dataframe specifying movie id and genre
df = pd.DataFrame({'id':tokenized_test_ds['id'], 'genre':predicted_genre})
df.head(3)

Out[ ]:

	id	genre
0	16863	family
1	48456	horror
2	41383	fantasy

In [ ]:

Copied!

# Save results to a csv file
df.to_csv('submission.csv')
# Save results to a csv file
df.to_csv('submission.csv')