This notebook is a follow-up to the EDA Kaggle and Arxiv datasets. The aim is to dive deeper into the following tasks:
- Integrate all text data-related competitions (9) from the past two years into the metadata analysis of the Kaggle write-ups. In the first EDA, we only referred to five competitions.
- Analyze the Arxiv dataset in greater detail to compare the insights gained in academia with those learned from text data write-ups reported in A Journey Through Text Data Competitions.
- Take a step further by using the PKE model to extract keywords from both Kaggle write-ups and Arxiv datasets, considering n-gram candidates, stopwords, and integrating a function to compute idf weights.
- Present results using resources other than horizontal bar plots, such as stylecloud and n-gram plots.
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# Installing Modules
!pip install git+https://github.com/boudinfl/pke.git
!pip install stylecloud wordcloud
# Library Definition
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import string
from string import punctuation
import pke
from pke import compute_document_frequency
import stylecloud
from PIL import Image
from IPython.display import Image
import gc
1. Analyzing Kaggle Writeups Dataset¶
# Reading the Kaggle writeups dataset
writeup_df = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv", parse_dates=[0,3]) # Consider the first four columns as date-format ones
writeup_df.head(3)
writeup_df.info(memory_usage='deep')
print("Number of writeups in the dataset: " + str(len(writeup_df)))
The Kaggle writeups dataset contains a total of 3,127 writeups and uses 22 MB of memory
How many unique competitions has the Kaggle writeups dataset?¶
num_competitions = writeup_df["Title of Competition"].nunique()
print(f"The dataset has {num_competitions} competitions.")
Answer: The Kaggle writeups dataset includes 310 competitions with a total of 3,127 writeups
What are the earliest and latest competitions?¶
early_comp = writeup_df["Competition Launch Date"].min().strftime('%Y-%m-%d')
late_comp = writeup_df["Competition Launch Date"].max().strftime('%Y-%m-%d')
print(f"The earliest competition is {early_comp} \nThe latest competition is {late_comp}")
Answer: The dates of the Kaggle competitions range from 2010-08-03 to 2023-02-23
How many competitions are from the past two years?¶
Let's consider competitions from January 2021 onwards.
writeup_past2years_df = writeup_df[(writeup_df["Competition Launch Date"].dt.year >= 2021) & (writeup_df["Competition Launch Date"].dt.month >= 1)]
numcomp_past2years = writeup_past2years_df['Title of Competition'].nunique()
print(f"Number of competitions from the past two years is {numcomp_past2years}")
len(writeup_past2years_df)
writeup_past2years_df["Title of Competition"].unique()
Answer: There are 71 competitions held within the past two years (from January 2021 to February 2023) having 1,073 writeups.
What are the competitions from the past two years with most writeups?¶
writeup_past2years_df["Title of Competition"].value_counts().head()
What are the competitions from the past two years with less writeups?¶
writeup_past2years_df["Title of Competition"].value_counts().tail()
Answers: The Feedback Prize - English Language Learning and Jigsaw Rate Severity of Toxic Comments competitions take the lead (both are text-related competitions) whereas the Herbarium 2021 and Herbarium 2022 competitions have only one writeup.
What is the number of writeups corresponding to text data competitions held in the past two years?¶
For this analysis, we're going to use the following external dataset Top 3 Kaggle Text Data Competitions (2021-2023) that has identified nine competitions related to text data.
textdata_df = pd.read_csv("/kaggle/input/top-3-kaggle-text-data-competitions-2021-2023/Summary_27write-ups_AIreport - Text Data Write-ups 27.csv")
textdata_df.head(3)
textdata_df["Competition"].unique()
# Turning unique competitions into a list
list_textdata_comp = list(textdata_df["Competition"].unique())
# Correcting middle dash typos of the list
list_textdata_comp[0] = 'Feedback Prize - Predicting Effective Arguments'
list_textdata_comp[3] = 'Feedback Prize - Evaluating Student Writing'
list_textdata_comp[5] = 'chaii - Hindi and Tamil Question Answering'
list_textdata_comp[7] = 'Coleridge Initiative - Show US the Data'
list_textdata_comp[8] = 'NBME - Score Clinical Patient Notes'
list_textdata_comp
# Filtering out text data related competitions from the writeups of the past two years
text_past2years_df = writeup_past2years_df[writeup_past2years_df["Title of Competition"].isin(list_textdata_comp)].copy()
text_past2years_df["Title of Competition"].unique()
print(f"Total writeups from the past two years: {len(writeup_past2years_df)}")
print(f"Total writeups related to text data competitions from the past two years: {len(text_past2years_df)}")
text_past2years_df = text_past2years_df.reset_index(drop=True)
text_past2years_df.sort_values(by='Competition Launch Date', ascending=True)
Response: There are 9 competitions related to text data spanning from March 2021 to May 2022 having a total of 208 writeups.
What is the number of writeups per text data competition?¶
text_past2years_df["Title of Competition"].value_counts()
Response: Jigsaw Rate Severity of Toxic Comments takes the lead with 33 writeups
2. Extracting keywords from text data writeups¶
Now that we have identified 9 competitions related to text data and their 208 writeups, let's analyze the content of the writeups using the PKE (Python Keyword Extraction) module. Before stepping into this task, it's paramount to implement a cleaning text data stage.
Cleaning text data¶
Let's have a look at the format of a single writeup:
text_past2years_df["Writeup"][0]
# Creating a function that performs several text data cleaning steps
def clean_text(df, col_to_clean):
# Remove HTML tags
df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
# Remove brackets and apostrophes from Python lists
df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
# Remove change of line characters
df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
# Remove special characters
df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
# Lowercase text
df['cleaned_text'] = df['cleaned_text'].str.lower()
return df
After applying the cleaning function, this is the outcome we obtained:
# Applying `clean_text()` function on writeups
text_past2years_df = clean_text(text_past2years_df, 'Writeup')
text_past2years_df['cleaned_text'][0]
Computing the frequency of keywords in writeups¶
# Creating a list containing all writeups
lst_writeups = text_past2years_df['cleaned_text'].to_list()
This function calculates the frequency of keywords in the collection of writeups. If using a CPU setting, this task will take around 7 min to complete.
#Reference1: https://github.com/boudinfl/pke/blob/master/examples/compute-df-counts.ipynb
#Reference2: https://boudinfl.github.io/pke/build/html/unsupervised.html
compute_document_frequency(
documents=lst_writeups, # List of writeups
output_file='inspec.df.gz',
language='en', # language of the input files
normalization='stemming', # use porter stemmer
stoplist=list(punctuation), # stoplist (punctuation marks)
n=3 # compute n-grams up to 3-grams
)
Let's have a look at the frequency of 20 keywords from the writeups collection
from pke import load_document_frequency_file
dict_freq = load_document_frequency_file(input_file='inspec.df.gz')
count = 0 # Initialize a counter
for key, value in dict_freq.items():
if count < 20: # Limit to the first 5 key-value pairs
print(f'{key}: {value}')
count += 1
else:
break
# Erasing non-utilized variables to freeing up memory
import gc
del writeup_df
del writeup_past2years_df
# Freeing up memory
gc.collect()
Extracting keywords from writeups¶
The keyword extraction stage is based on the TfIdf (Term Frequency-Inverse Document Frequency) method from PKE. TfIdf is a popular and effective technique for identifying keyphrases in a collection of text documents. We have created the extract_keywords() function to extract the top 5 keywords from each writeup. This function will process the 208 writeups and render a total of 1,040 keywords (208*5).
def extract_keywords(text):
stoplist = list(string.punctuation)
stoplist += pke.lang.stopwords.get('en')
extractor = pke.unsupervised.TfIdf()
extractor.load_document(input=text,
language='en',
stoplist=stoplist,
normalization=None)
extractor.candidate_selection(n=3) #Select 1 to 3 grams
df = load_document_frequency_file(input_file='inspec.df.gz')
extractor.candidate_weighting(df=df) #Candidate weighting using document frequencies
keyphrases = extractor.get_n_best(n=10)
# Extract top 5 keywords
keywords = [keyword[0] for keyword in keyphrases[:5]]
return keywords
# Calculating a memory usage estimation of the collection of writeups to be processed.
text_past2years_df['cleaned_text'].info(memory_usage='deep')
# Creating a bar to track the keyword extraction progress
from tqdm import tqdm
with tqdm(total=len(text_past2years_df), desc="Processing") as pbar:
def apply_with_progress(text):
result = extract_keywords(text)
pbar.update(1) # Update the progress bar
return result
# Apply the function to the Series with progress tracking
abstract_keywords = text_past2years_df['cleaned_text'].apply(apply_with_progress)
# Displaying the top 5 keywords of the first 10 writeups
abstract_keywords[:10]
The result is a list of lists that contains the top 5 keywords of each writeup. Let's store these results in a new keywords_lst column.
text_past2years_df['keywords_lst'] = abstract_keywords
text_past2years_df.head(1)
# Freeing up memory
gc.collect()
Plotting extracted keywords from writeups¶
Let's count the most mentioned keywords in the collection of writeups
text_past2years_df_exploded = text_past2years_df.explode('keywords_lst')
text_past2years_df_exploded = text_past2years_df_exploded.reset_index(drop=True)
text_past2years_df_exploded['keywords_lst'].value_counts().head(30)
# Converting the previous keyword list into a dataframe
keywords_count_serie = text_past2years_df_exploded['keywords_lst'].value_counts()
keywords_count_df = pd.DataFrame({'Keywords': keywords_count_serie.index,'Count': keywords_count_serie.values})
keywords_count_df.head(5)
# Selecting the top 50 words
keywords_count_50_df = keywords_count_df[:50]
import plotly.express as px
fig = px.bar(keywords_count_50_df, x='Count', y='Keywords', title='Top 50 keywords found in all Writeups by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()
# Filtering out duplicated keywords
keywords_tree = text_past2years_df_exploded['keywords_lst'].to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")
# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
text=' '.join(lst_keywords_tree),
icon_name='fas fa-tree', # 'fas fa-cloud'; 'fas fa-eye'; ''
palette='cmocean.sequential.Matter_10',
background_color='black',
gradient='horizontal',
size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)
3. Examining popular architectures, domains, and techniques used in Kaggle writeups based on word occurrences¶
In the previous section, we extracted the top 5 keywords of every writeup and computed an empirical analysis of their occurrences in all writeups. We identified common words, including 'models,' 'competition,' 'training,' 'ensemble,' and 'different.' However, these words do not appear to offer valuable insights about the write-ups. In this section, we will formulate specific questions and provide keywords that are more likely to yield better results in understanding the techniques, text data domains, and architectures used in Kaggle's text data competitions.
What are the main architectures used in the solutions of text data competitions?¶
We considered the following 16 architectures as keywords for this question.
text_architectures_keywords = [
"fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
"encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]
# Function that matchs a list of specific words with a given column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
# Convert the string_list input to a set for faster membership checking
strings_set = set(strings_list)
# Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list'
# This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
# Create a dictionary to store the counting results
results_dict = {'String': [], 'Occurrences':[]}
# Iterate over the strings list
for string in strings_list:
# Add the string and its corresponding count to the dictionary
results_dict['String'].append(string)
# Count the actual ocurrences in the filtered dataframe
actual_occurrences = filtered_df[column_name].str.count(string).sum()
results_dict['Occurrences'].append(actual_occurrences)
# Convert the dictionary to a dataframe
counts_df = pd.DataFrame(results_dict)
return counts_df
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')
ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in Kaggle text data competitions', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation=90)
plt.show()
Answer: The top 3 architectures used in Kaggle text data competitions are BERT, DEBERTA, and ROBERTA.
Which of the following techniques is mostly used in the solutions of text data competitions?¶
techniques_keywords = [
"pseudo labeling",
"masked language modeling",
"adversarial weight perturbation",
"model ensembling",
"model efficiency",
"data augmentation"
]
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')
ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Kaggle writeups')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation=90)
plt.show()
Answer: Pseudo labeling is the most referred technique along with data augmentation. Interestingly, it seems that kagglers didn't worry at all about optimizing their model's efficiency.
Which of the following domains is mostly referred in the solutions of text data competitions?¶
domain_keywords = [
"text mining",
"text analytics",
"text preprocessing",
"text classification",
"text clustering",
"named entity recognition",
"topic modeling",
"information retrieval",
"text summarization",
"text generation",
"text similarity",
"word embeddings",
"document classification",
"text feature extraction",
"text segmentation",
"text normalization",
"text corpora",
"textual data analysis",
"question answering",
"sentiment analysis",
"language modeling"
]
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', domain_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')
ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Kaggle writeups')
ax.set_xlim([0, 12])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation=90)
plt.show()
Answer: Question and answering and text classification domains are the most referred in the collection of writeups.
# Freeing up memory
gc.collect()
4. Analyzing the Arxiv Dataset¶
Likewise the Kaggle writeup dataset, we're going to analyze the Arxiv dataset to gain more insights about the strategies followed in Academia related to text data.
# Loading the Arxiv Dataset
df_arxiv = pd.read_json(
'/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json',
lines = True,
convert_dates = True,
chunksize = 100000
)
# Reading a single chunk from the Arxiv dataset
for chunk in df_arxiv:
break
len(chunk)
chunk.head(3)
chunk.info(memory_usage='deep')
# Reading all chunks and concatenating them into a single dataframe
arxiv_df = pd.DataFrame()
for chunk in df_arxiv:
arxiv_df = pd.concat([arxiv_df, chunk], ignore_index=True)
arxiv_df.head(5)
arxiv_df.info(memory_usage='deep')
We found around 2.15 M papers in the Arxiv dataset. This dataframe has a memory usage of 4.0 GB.
How many distinct categories of papers are present in the ArXiv Dataset?¶
print(f"The Arxiv Dataset has {arxiv_df['categories'].nunique()} unique categories")
Answer: The Arxiv dataset has around 76 k categories
What are the earliest and latest papers in the Arxiv Dataset?¶
# Turning the 'update_date' column into a datetime format column
arxiv_df['update_date'] = pd.to_datetime(arxiv_df['update_date'])
arxiv_df.info()
arxiv_date_min = arxiv_df['update_date'].min().strftime('%Y-%m-%d')
arxiv_date_max = arxiv_df['update_date'].max().strftime('%Y-%m-%d')
print(f"The Arxiv Dataset includes papers from {arxiv_date_min} to {arxiv_date_max}")
Answer: The Arxiv Dataset contains papers from 2007-05-23 to 2023-05-05
How many papers has the Arxiv Dataset from the past two years?¶
arxiv_2years_df = arxiv_df[(arxiv_df['update_date'].dt.year >= 2021) & (arxiv_df['update_date'].dt.month >= 1)].copy()
arxiv_2years_df = arxiv_2years_df.reset_index(drop=True)
arxiv_2years_df.head(3)
print(f"The Arxiv Dataset contains {len(arxiv_2years_df)} papers from January 2021 to 2023")
Answer: The Arxiv Dataset contains around 527 k papers from January 2021 to 2023
What are the papers with most categories from the past two years?¶
arxiv_2years_df['categories'].value_counts().head(20)
Answer: Computer Vision takes the lead on number of papers followed by quantum physics.
What are the papers with less categories from the past two years?¶
arxiv_2years_df['categories'].value_counts().tail(20)
Answer: We noticed that categories that are concatenated in a single row are hard to identify as unique categories
What is the number of NLP-related papers from the past two years?¶
To identify the most popular categories in the NLP domain in the Arxiv Dataset, we searched the term Natural Language Processing in the arxiv dataset from 2021-01 to 2023. Then we ordered the results by Annnoucement date (oldest first) and then by Annnoucement date (newest first) and identified 19 categories that were mostly referred by researchers.
# Filtering out only text data related papers from the Arxiv dataset
nlp_categories_arxiv = [
"cs.SE", # Software Engineering
"cs.CY", # Computers and Society
"cs.IR", # Information Retrieval
"cs.CL", # Computation and Language
"cs.LG", # Machine Learning
"cs.NE", # Neural and Evolutionary Computing
"cs.AI", # Artificial Intelligence
"cs.DL", # Digital Libraries
"cs.HC", # Human Computer Interaction
"cs.SI", # Social and Information Networks
"stat.ML", # Machine Learning
"cs.SD", # Sound
"cs.CR", # Cryptography and Security
"q-fin.ST", # Statistical Finance
"quant-ph", # Quantum Physics
"q-bio.OT", # Other Quantitative Biology
"physics.comp-ph", # Computational Physics
"physics.data-an", # Data Analysis, Statistics, and Probability
"cs.AR" # Hardware Architecture
]
nlp_arxiv_2years_df = arxiv_2years_df[arxiv_2years_df.categories.isin(nlp_categories_arxiv)]
nlp_arxiv_2years_df = nlp_arxiv_2years_df.reset_index(drop=True)
print(f"Number of papers from the past two years: {len(arxiv_2years_df)} \nNumber of NLP-related papers from the past two years: {len(nlp_arxiv_2years_df)}")
nlp_arxiv_2years_df.sort_values('update_date', ascending=True)
Answer: There is a total of 527 k papers from the past two years (from January 2021 to May 2023) and around 46 k of them corresponding to the NLP domain.
What is the count of NLP-related papers per category from the past two years?¶
nlp_keywords_serie = nlp_arxiv_2years_df['categories'].value_counts()
nlp_keywords_serie
import plotly.express as px
keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_serie.index,'Count': nlp_keywords_serie.values})
fig = px.bar(keywords_count_df, x='Count', y='Keywords', title='NLP-related papers per category found in the Arxiv dataset', orientation='h', width=750, height=900, color='Keywords')
fig.show()
Answer: Interestingly, Quantum Physics takes the lead in NLP-related papers followed by Computation and Language, and Machine Learning papers.
5. Extracting keywords from Arxiv papers¶
In this section, we analyze the content of abstracts of papers using the PKE (Python Keyword Extraction) module to identify main keywords.
Cleaning the Arxiv Dataset¶
Before stepping into this task, it's paramount to implement a cleaning text data stage.
# Eliminating duplicates
nlp_arxiv_2years_cleaned_df = nlp_arxiv_2years_df.drop_duplicates(subset=['title'])
len(nlp_arxiv_2years_df), len(nlp_arxiv_2years_cleaned_df)
# Printing a sample abstract
nlp_arxiv_2years_cleaned_df['abstract'][0]
# Creating a function that performs several text data cleaning steps
def clean_text(df, col_to_clean):
# Remove HTML tags
df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
# Remove brackets and apostrophes from Python lists
df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
# Remove change of line characters
df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
# Remove special characters
df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
# Lowercase text
df['cleaned_text'] = df['cleaned_text'].str.lower()
return df
# Applying the `clean_text()` function on abstracts
nlp_arxiv_2years_copy_df = nlp_arxiv_2years_cleaned_df.copy()
nlp_arxiv_2years_copy_df = clean_text(nlp_arxiv_2years_copy_df, 'abstract')
# Printing a cleaned abstract
nlp_arxiv_2years_copy_df['cleaned_text'][0]
nlp_arxiv_2years_copy_df.head(1)
Computing keyword extraction on the Arxiv Dataset¶
The keyword extraction stage is based on the TfIdf (Term Frequency-Inverse Document Frequency) method from PKE. TfIdf is a popular and effective technique for identifying keyphrases in a collection of text documents. We have created the extract_keywords() function to extract the top 5 keywords from each abstract.
# Cleaning up memory
import gc
del df_arxiv
del chunk
del arxiv_df
del arxiv_2years_df
del nlp_arxiv_2years_df
del nlp_arxiv_2years_cleaned_df
gc.collect()
!pip install git+https://github.com/boudinfl/pke.git
def extract_keywords(text):
stoplist = list(string.punctuation)
stoplist += pke.lang.stopwords.get('en')
extractor = pke.unsupervised.TfIdf()
extractor.load_document(input=text,
language='en',
stoplist=stoplist,
normalization=None)
extractor.candidate_selection() #Select 1 to 3 grams
extractor.candidate_weighting() #Candidate weighting using document frequencies
keyphrases = extractor.get_n_best(n=10)
# Extract top 5 keywords
keywords = [keyword[0] for keyword in keyphrases[:5]]
return keywords
nlp_arxiv_2years_copy_df['cleaned_text'].info(memory_usage='deep')
Important: The size of text-related papers (52.5 MB) is 250 times larger than that of the processed Kaggle writeups collection (208 KB). This implies that the keyword extraction process for the ArXiv dataset will experience a significant RAM memory overload if using the standard CPU settings. For this reason, we have selected only 5,000 abstracts to prove that the keyword extraction process works by identifying 25,000 keywords (5*5,000).
# Selecting only 5,000 abstracts to be processed
clean_abstract = nlp_arxiv_2years_copy_df['cleaned_text'][:5000]
clean_abstract.info(memory_usage='deep')
Implementing a progress bar to track the keyword extraction process¶
!pip install tqdm
from tqdm import tqdm
# Creating a tqdm progress bar to track the keyword extraction process. It takes about 6 h
cleanup_interval = 20
with tqdm(total=len(clean_abstract), desc="Processing") as pbar:
def apply_with_progress(text):
result = extract_keywords(text)
pbar.update(1) # Update the progress bar
# Check if it's time to clean up memory
if pbar.n % cleanup_interval == 0:
gc.collect()
return result
# Apply the function to the Series with progress tracking
abstract_keywords = clean_abstract.apply(apply_with_progress)
abstract_keywords[100:]
Plotting identified keywords from 5,000 papers¶
# Finding the most popular keywords from 5,000 papers
nlp_keywords_serie = abstract_keywords.explode()
nlp_keywords_count = nlp_keywords_serie.value_counts()
nlp_keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_count.index,'Count': nlp_keywords_count.values})
nlp_keywords_count_df.head(5)
# Selecting the top 50 words
nlp_keywords_count50_df = nlp_keywords_count_df[:50]
import plotly.express as px
fig = px.bar(nlp_keywords_count50_df, x='Count', y='Keywords', title='Top 50 keywords found in 5,000 abstracts by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()
# Filtering out duplicated keywords
keywords_tree = nlp_keywords_serie.to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")
# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
text=' '.join(lst_keywords_tree),
icon_name='fas fa-tree', # 'fas fa-cloud'; 'fas fa-eye'; ''
palette='cmocean.sequential.Matter_10',
background_color='black',
gradient='horizontal',
size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)
6. Examining popular architectures, domains, and techniques in the Arxiv dataset based on word occurrences¶
Likewise the analysis on the Kaggle writeup dataset, in this section we make specific questions and provide keywords to narrow down our analysis. We will focus specifically on the techniques, text data domains, and architectures mostly employed in the papers of the Arxiv dataset.
What are the main architectures used in Academia?¶
We considered the following 16 architectures as keywords for this question.
text_architectures_keywords = [
"fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
"encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]
# Function that matchs a list of specific words with a column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
# Convert the string_list input to a set format for faster membership checking
strings_set = set(strings_list)
# Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list'
# This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
# Create a dictionary to store the counting results
results_dict = {'String': [], 'Occurrences':[]}
# Iterate over the strings list
for string in strings_list:
# Add the string and its corresponding count to the dictionary
results_dict['String'].append(string)
# Count the actual ocurrences in the filtered dataframe
actual_occurrences = filtered_df[column_name].str.count(string).sum()
results_dict['Occurrences'].append(actual_occurrences)
# Convert the dictionary to a dataframe
counts_df = pd.DataFrame(results_dict)
return counts_df
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')
ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in the Arxiv Dataset', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation=90)
plt.show()
Answer: It seems that BERT and encoder-based architectures take the lead in Academia along with trasformers.
Which of the following techniques is mostly used in Academia?¶
techniques_keywords = [
"pseudo labeling",
"masked language modeling",
"adversarial weight perturbation",
"model ensembling",
"model efficiency",
"data augmentation"
]
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')
ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Academia')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation=90)
plt.show()
Answer: Unlike the text-related Kaggle writeups, researchers are more interested in data augmentation techniques rather than pseudo labeling.
Which of the following domains is mostly referred in Academia?¶
text_data_keywords = [
"text mining",
"text analytics",
"text preprocessing",
"text classification",
"text clustering",
"named entity recognition",
"topic modeling",
"information retrieval",
"text summarization",
"text generation",
"text similarity",
"word embeddings",
"document classification",
"text feature extraction",
"text segmentation",
"text normalization",
"text corpora",
"textual data analysis",
"question answering",
"sentiment analysis",
"language modeling"
]
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_data_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')
ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Academia')
ax.set_xlim([0, 800])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation=90)
plt.show()
Answer: Regarding the NLP domains, we found agreement of interest in both the Kaggle community and Academia focusing their efforts on question and answering and text classification fields.
Conclusion¶
By considering 9 text-data-related competitions instead of 5, we identified 208 writeups (70 times more data to be analyzed than in our previous EDA). This helped us gain a better understanding of Kaggle text data competitions. We've also expanded our consideration to 19 categories that could contain NLP papers, as opposed to the previous 12, for the ArXiv dataset. Here's a general summary of our findings over the last two years:
- BERT and encoder-based architectures are the most popular in both the Kaggle community and academic contexts.
- Pseudo labeling was the most frequently referenced technique in text data writeups, while data augmentation was more prevalent in text-related papers. It appears that researchers prioritize model efficiency, whereas Kagglers might overlook it in their solutions.
- Both the Kaggle community and academia are increasingly focusing their efforts on Question and Answer (Q&A) and text classification domains.
Appendix¶
Finally, here are some useful tips for processing large datasets:
Keep an eye on the RAM memory usage at every stage of your dataset analysis. You can:
- Assess the memory size of dataframes using the
df.info(memory_usage='deep')command - Consider removing dataframes that you no longer need with
del df - Free up memory whenever possible using the
gc.collect()command. - Use the following commands to assess the memory usage of your variables:
from __future__ import print_function # for Python2 import sys local_vars = list(locals().items()) for var, obj in local_vars: print(var, sys.getsizeof(obj))- Assess the memory size of dataframes using the
Implement a progress bar when executing large processes