In Depth Analysis of Kaggle and Arxiv Datasets¶

by Salomon Marquez

01/12/2023

This notebook is a follow-up to the EDA Kaggle and Arxiv datasets. The aim is to dive deeper into the following tasks:

Integrate all text data-related competitions (9) from the past two years into the metadata analysis of the Kaggle write-ups. In the first EDA, we only referred to five competitions.
Analyze the Arxiv dataset in greater detail to compare the insights gained in academia with those learned from text data write-ups reported in A Journey Through Text Data Competitions.
Take a step further by using the PKE model to extract keywords from both Kaggle write-ups and Arxiv datasets, considering n-gram candidates, stopwords, and integrating a function to compute idf weights.
Present results using resources other than horizontal bar plots, such as stylecloud and n-gram plots.

In [ ]:

Copied!





# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [ ]:

Copied!

# Installing Modules
!pip install git+https://github.com/boudinfl/pke.git
!pip install stylecloud wordcloud
# Installing Modules
!pip install git+https://github.com/boudinfl/pke.git
!pip install stylecloud wordcloud

In [ ]:

Copied!





# Library Definition
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

import re

import string
from string import punctuation
import pke
from pke import compute_document_frequency

import stylecloud
from PIL import Image
from IPython.display import Image

import gc
# Library Definition
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

import re

import string
from string import punctuation
import pke
from pke import compute_document_frequency

import stylecloud
from PIL import Image
from IPython.display import Image

import gc

1. Analyzing Kaggle Writeups Dataset¶

In [ ]:

Copied!

# Reading the Kaggle writeups dataset
writeup_df = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv", parse_dates=[0,3]) # Consider the first four columns as date-format ones
writeup_df.head(3)
# Reading the Kaggle writeups dataset
writeup_df = pd.read_csv("/kaggle/input/2023-kaggle-ai-report/kaggle_writeups_20230510.csv", parse_dates=[0,3]) # Consider the first four columns as date-format ones
writeup_df.head(3)

In [ ]:

Copied!

writeup_df.info(memory_usage='deep')
writeup_df.info(memory_usage='deep')

In [ ]:

Copied!

print("Number of writeups in the dataset: " + str(len(writeup_df)))
print("Number of writeups in the dataset: " + str(len(writeup_df)))

The Kaggle writeups dataset contains a total of 3,127 writeups and uses 22 MB of memory

How many unique competitions has the Kaggle writeups dataset?¶

In [ ]:

Copied!

num_competitions = writeup_df["Title of Competition"].nunique()
print(f"The dataset has {num_competitions} competitions.")
num_competitions = writeup_df["Title of Competition"].nunique()
print(f"The dataset has {num_competitions} competitions.")

Answer: The Kaggle writeups dataset includes 310 competitions with a total of 3,127 writeups

What are the earliest and latest competitions?¶

In [ ]:

Copied!

early_comp = writeup_df["Competition Launch Date"].min().strftime('%Y-%m-%d')
late_comp = writeup_df["Competition Launch Date"].max().strftime('%Y-%m-%d')

print(f"The earliest competition is {early_comp} \nThe latest competition is {late_comp}")
early_comp = writeup_df["Competition Launch Date"].min().strftime('%Y-%m-%d')
late_comp = writeup_df["Competition Launch Date"].max().strftime('%Y-%m-%d')

print(f"The earliest competition is {early_comp} \nThe latest competition is {late_comp}")

Answer: The dates of the Kaggle competitions range from 2010-08-03 to 2023-02-23

How many competitions are from the past two years?¶

Let's consider competitions from January 2021 onwards.

In [ ]:

Copied!

writeup_past2years_df = writeup_df[(writeup_df["Competition Launch Date"].dt.year >= 2021) & (writeup_df["Competition Launch Date"].dt.month >= 1)]
numcomp_past2years = writeup_past2years_df['Title of Competition'].nunique()

print(f"Number of competitions from the past two years is {numcomp_past2years}")
writeup_past2years_df = writeup_df[(writeup_df["Competition Launch Date"].dt.year >= 2021) & (writeup_df["Competition Launch Date"].dt.month >= 1)]
numcomp_past2years = writeup_past2years_df['Title of Competition'].nunique()

print(f"Number of competitions from the past two years is {numcomp_past2years}")

In [ ]:

Copied!

len(writeup_past2years_df)
len(writeup_past2years_df)

In [ ]:

Copied!

writeup_past2years_df["Title of Competition"].unique()
writeup_past2years_df["Title of Competition"].unique()

Answer: There are 71 competitions held within the past two years (from January 2021 to February 2023) having 1,073 writeups.

What are the competitions from the past two years with most writeups?¶

In [ ]:

Copied!

writeup_past2years_df["Title of Competition"].value_counts().head()
writeup_past2years_df["Title of Competition"].value_counts().head()

What are the competitions from the past two years with less writeups?¶

In [ ]:

Copied!

writeup_past2years_df["Title of Competition"].value_counts().tail()
writeup_past2years_df["Title of Competition"].value_counts().tail()

Answers: The Feedback Prize - English Language Learning and Jigsaw Rate Severity of Toxic Comments competitions take the lead (both are text-related competitions) whereas the Herbarium 2021 and Herbarium 2022 competitions have only one writeup.

What is the number of writeups corresponding to text data competitions held in the past two years?¶

For this analysis, we're going to use the following external dataset Top 3 Kaggle Text Data Competitions (2021-2023) that has identified nine competitions related to text data.

In [ ]:

Copied!

textdata_df = pd.read_csv("/kaggle/input/top-3-kaggle-text-data-competitions-2021-2023/Summary_27write-ups_AIreport - Text Data Write-ups 27.csv")
textdata_df.head(3)
textdata_df = pd.read_csv("/kaggle/input/top-3-kaggle-text-data-competitions-2021-2023/Summary_27write-ups_AIreport - Text Data Write-ups 27.csv")
textdata_df.head(3)

In [ ]:

Copied!

textdata_df["Competition"].unique()
textdata_df["Competition"].unique()

In [ ]:

Copied!

# Turning unique competitions into a list 
list_textdata_comp = list(textdata_df["Competition"].unique())
# Turning unique competitions into a list 
list_textdata_comp = list(textdata_df["Competition"].unique())

In [ ]:

Copied!





# Correcting middle dash typos of the list
list_textdata_comp[0] = 'Feedback Prize - Predicting Effective Arguments'
list_textdata_comp[3] = 'Feedback Prize - Evaluating Student Writing'
list_textdata_comp[5] = 'chaii - Hindi and Tamil Question Answering'
list_textdata_comp[7] = 'Coleridge Initiative - Show US the Data'
list_textdata_comp[8] = 'NBME - Score Clinical Patient Notes'

list_textdata_comp
# Correcting middle dash typos of the list
list_textdata_comp[0] = 'Feedback Prize - Predicting Effective Arguments'
list_textdata_comp[3] = 'Feedback Prize - Evaluating Student Writing'
list_textdata_comp[5] = 'chaii - Hindi and Tamil Question Answering'
list_textdata_comp[7] = 'Coleridge Initiative - Show US the Data'
list_textdata_comp[8] = 'NBME - Score Clinical Patient Notes'

list_textdata_comp

In [ ]:

Copied!

# Filtering out text data related competitions from the writeups of the past two years
text_past2years_df = writeup_past2years_df[writeup_past2years_df["Title of Competition"].isin(list_textdata_comp)].copy()
text_past2years_df["Title of Competition"].unique()
# Filtering out text data related competitions from the writeups of the past two years
text_past2years_df = writeup_past2years_df[writeup_past2years_df["Title of Competition"].isin(list_textdata_comp)].copy()
text_past2years_df["Title of Competition"].unique()

In [ ]:

Copied!

print(f"Total writeups from the past two years: {len(writeup_past2years_df)}")
print(f"Total writeups related to text data competitions from the past two years: {len(text_past2years_df)}")
print(f"Total writeups from the past two years: {len(writeup_past2years_df)}")
print(f"Total writeups related to text data competitions from the past two years: {len(text_past2years_df)}")

In [ ]:

Copied!

text_past2years_df = text_past2years_df.reset_index(drop=True)
text_past2years_df.sort_values(by='Competition Launch Date', ascending=True)
text_past2years_df = text_past2years_df.reset_index(drop=True)
text_past2years_df.sort_values(by='Competition Launch Date', ascending=True)

Response: There are 9 competitions related to text data spanning from March 2021 to May 2022 having a total of 208 writeups.

What is the number of writeups per text data competition?¶

In [ ]:

Copied!

text_past2years_df["Title of Competition"].value_counts()
text_past2years_df["Title of Competition"].value_counts()

Response: Jigsaw Rate Severity of Toxic Comments takes the lead with 33 writeups

2. Extracting keywords from text data writeups¶

Now that we have identified 9 competitions related to text data and their 208 writeups, let's analyze the content of the writeups using the PKE (Python Keyword Extraction) module. Before stepping into this task, it's paramount to implement a cleaning text data stage.

Cleaning text data¶

Let's have a look at the format of a single writeup:

In [ ]:

Copied!

text_past2years_df["Writeup"][0]
text_past2years_df["Writeup"][0]

In [ ]:

Copied!





# Creating a function that performs several text data cleaning steps 
def clean_text(df, col_to_clean):

    # Remove HTML tags
    df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
 
    # Remove brackets and apostrophes from Python lists
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
    
    # Remove change of line characters 
    df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
   
    # Remove special characters
    df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
    df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
     
    # Lowercase text
    df['cleaned_text'] = df['cleaned_text'].str.lower()
    
    return df
# Creating a function that performs several text data cleaning steps 
def clean_text(df, col_to_clean):

    # Remove HTML tags
    df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
 
    # Remove brackets and apostrophes from Python lists
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
    
    # Remove change of line characters 
    df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
   
    # Remove special characters
    df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
    df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
     
    # Lowercase text
    df['cleaned_text'] = df['cleaned_text'].str.lower()
    
    return df

After applying the cleaning function, this is the outcome we obtained:

In [ ]:

Copied!

# Applying `clean_text()` function on writeups
text_past2years_df = clean_text(text_past2years_df, 'Writeup')
text_past2years_df['cleaned_text'][0]
# Applying `clean_text()` function on writeups
text_past2years_df = clean_text(text_past2years_df, 'Writeup')
text_past2years_df['cleaned_text'][0]

Computing the frequency of keywords in writeups¶

In [ ]:

Copied!

# Creating a list containing all writeups
lst_writeups = text_past2years_df['cleaned_text'].to_list()
# Creating a list containing all writeups
lst_writeups = text_past2years_df['cleaned_text'].to_list()

This function calculates the frequency of keywords in the collection of writeups. If using a CPU setting, this task will take around 7 min to complete.

In [ ]:

Copied!





#Reference1: https://github.com/boudinfl/pke/blob/master/examples/compute-df-counts.ipynb
#Reference2: https://boudinfl.github.io/pke/build/html/unsupervised.html
compute_document_frequency(
    documents=lst_writeups,     # List of writeups
    output_file='inspec.df.gz',
    language='en',              # language of the input files
    normalization='stemming',   # use porter stemmer
    stoplist=list(punctuation), # stoplist (punctuation marks)
    n=3                         # compute n-grams up to 3-grams
)
#Reference1: https://github.com/boudinfl/pke/blob/master/examples/compute-df-counts.ipynb
#Reference2: https://boudinfl.github.io/pke/build/html/unsupervised.html
compute_document_frequency(
    documents=lst_writeups,     # List of writeups
    output_file='inspec.df.gz',
    language='en',              # language of the input files
    normalization='stemming',   # use porter stemmer
    stoplist=list(punctuation), # stoplist (punctuation marks)
    n=3                         # compute n-grams up to 3-grams
)

Let's have a look at the frequency of 20 keywords from the writeups collection

In [ ]:

Copied!





from pke import load_document_frequency_file
dict_freq = load_document_frequency_file(input_file='inspec.df.gz')

count = 0  # Initialize a counter
for key, value in dict_freq.items():
    if count < 20:  # Limit to the first 5 key-value pairs
        print(f'{key}: {value}')
        count += 1
    else:
        break
from pke import load_document_frequency_file
dict_freq = load_document_frequency_file(input_file='inspec.df.gz')

count = 0  # Initialize a counter
for key, value in dict_freq.items():
    if count < 20:  # Limit to the first 5 key-value pairs
        print(f'{key}: {value}')
        count += 1
    else:
        break

In [ ]:

Copied!





# Erasing non-utilized variables to freeing up memory
import gc

del writeup_df
del writeup_past2years_df 

# Freeing up memory 
gc.collect()
# Erasing non-utilized variables to freeing up memory
import gc

del writeup_df
del writeup_past2years_df 

# Freeing up memory 
gc.collect()

Extracting keywords from writeups¶

The keyword extraction stage is based on the TfIdf (Term Frequency-Inverse Document Frequency) method from PKE. TfIdf is a popular and effective technique for identifying keyphrases in a collection of text documents. We have created the extract_keywords() function to extract the top 5 keywords from each writeup. This function will process the 208 writeups and render a total of 1,040 keywords (208*5).

In [ ]:

Copied!





def extract_keywords(text):
    stoplist = list(string.punctuation)
    stoplist += pke.lang.stopwords.get('en')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text,
                           language='en',
                           stoplist=stoplist,
                           normalization=None)
 
    extractor.candidate_selection(n=3) #Select 1 to 3 grams
    df = load_document_frequency_file(input_file='inspec.df.gz')
    extractor.candidate_weighting(df=df) #Candidate weighting using document frequencies
    keyphrases = extractor.get_n_best(n=10)
    
    # Extract top 5 keywords
    keywords = [keyword[0] for keyword in keyphrases[:5]]
    
    return keywords
def extract_keywords(text):
    stoplist = list(string.punctuation)
    stoplist += pke.lang.stopwords.get('en')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text,
                           language='en',
                           stoplist=stoplist,
                           normalization=None)
 
    extractor.candidate_selection(n=3) #Select 1 to 3 grams
    df = load_document_frequency_file(input_file='inspec.df.gz')
    extractor.candidate_weighting(df=df) #Candidate weighting using document frequencies
    keyphrases = extractor.get_n_best(n=10)
    
    # Extract top 5 keywords
    keywords = [keyword[0] for keyword in keyphrases[:5]]
    
    return keywords

In [ ]:

Copied!

# Calculating a memory usage estimation of the collection of writeups to be processed.
text_past2years_df['cleaned_text'].info(memory_usage='deep')
# Calculating a memory usage estimation of the collection of writeups to be processed.
text_past2years_df['cleaned_text'].info(memory_usage='deep')

In [ ]:

Copied!





# Creating a bar to track the keyword extraction progress
from tqdm import tqdm
with tqdm(total=len(text_past2years_df), desc="Processing") as pbar:
    def apply_with_progress(text):
        result = extract_keywords(text)
        pbar.update(1)  # Update the progress bar
        return result

    # Apply the function to the Series with progress tracking
    abstract_keywords = text_past2years_df['cleaned_text'].apply(apply_with_progress)
# Creating a bar to track the keyword extraction progress
from tqdm import tqdm
with tqdm(total=len(text_past2years_df), desc="Processing") as pbar:
    def apply_with_progress(text):
        result = extract_keywords(text)
        pbar.update(1)  # Update the progress bar
        return result

    # Apply the function to the Series with progress tracking
    abstract_keywords = text_past2years_df['cleaned_text'].apply(apply_with_progress)

In [ ]:

Copied!

# Displaying the top 5 keywords of the first 10 writeups
abstract_keywords[:10]
# Displaying the top 5 keywords of the first 10 writeups
abstract_keywords[:10]

The result is a list of lists that contains the top 5 keywords of each writeup. Let's store these results in a new keywords_lst column.

In [ ]:

Copied!

text_past2years_df['keywords_lst'] = abstract_keywords
text_past2years_df.head(1)
text_past2years_df['keywords_lst'] = abstract_keywords
text_past2years_df.head(1)

In [ ]:

Copied!

# Freeing up memory 
gc.collect()
# Freeing up memory 
gc.collect()

Plotting extracted keywords from writeups¶

Let's count the most mentioned keywords in the collection of writeups

In [ ]:

Copied!

text_past2years_df_exploded = text_past2years_df.explode('keywords_lst')
text_past2years_df_exploded = text_past2years_df_exploded.reset_index(drop=True)
text_past2years_df_exploded['keywords_lst'].value_counts().head(30)
text_past2years_df_exploded = text_past2years_df.explode('keywords_lst')
text_past2years_df_exploded = text_past2years_df_exploded.reset_index(drop=True)
text_past2years_df_exploded['keywords_lst'].value_counts().head(30)

In [ ]:

Copied!





# Converting the previous keyword list into a dataframe
keywords_count_serie = text_past2years_df_exploded['keywords_lst'].value_counts()
keywords_count_df = pd.DataFrame({'Keywords': keywords_count_serie.index,'Count': keywords_count_serie.values})
keywords_count_df.head(5)
# Converting the previous keyword list into a dataframe
keywords_count_serie = text_past2years_df_exploded['keywords_lst'].value_counts()
keywords_count_df = pd.DataFrame({'Keywords': keywords_count_serie.index,'Count': keywords_count_serie.values})
keywords_count_df.head(5)

In [ ]:

Copied!

# Selecting the top 50 words
keywords_count_50_df = keywords_count_df[:50]
# Selecting the top 50 words
keywords_count_50_df = keywords_count_df[:50]

In [ ]:

Copied!

import plotly.express as px

fig = px.bar(keywords_count_50_df, x='Count', y='Keywords', title='Top 50 keywords found in all Writeups by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()
import plotly.express as px

fig = px.bar(keywords_count_50_df, x='Count', y='Keywords', title='Top 50 keywords found in all Writeups by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()

In [ ]:

Copied!





# Filtering out duplicated keywords
keywords_tree = text_past2years_df_exploded['keywords_lst'].to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")
# Filtering out duplicated keywords
keywords_tree = text_past2years_df_exploded['keywords_lst'].to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")

In [ ]:

Copied!





# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
    text=' '.join(lst_keywords_tree), 
    icon_name='fas fa-tree',                     # 'fas fa-cloud'; 'fas fa-eye'; ''
    palette='cmocean.sequential.Matter_10',
    background_color='black',
    gradient='horizontal',
    size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)
# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
    text=' '.join(lst_keywords_tree), 
    icon_name='fas fa-tree',                     # 'fas fa-cloud'; 'fas fa-eye'; ''
    palette='cmocean.sequential.Matter_10',
    background_color='black',
    gradient='horizontal',
    size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)

3. Examining popular architectures, domains, and techniques used in Kaggle writeups based on word occurrences¶

In the previous section, we extracted the top 5 keywords of every writeup and computed an empirical analysis of their occurrences in all writeups. We identified common words, including 'models,' 'competition,' 'training,' 'ensemble,' and 'different.' However, these words do not appear to offer valuable insights about the write-ups. In this section, we will formulate specific questions and provide keywords that are more likely to yield better results in understanding the techniques, text data domains, and architectures used in Kaggle's text data competitions.

What are the main architectures used in the solutions of text data competitions?¶

We considered the following 16 architectures as keywords for this question.

In [ ]:

Copied!





text_architectures_keywords = [
    "fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
    "encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]
text_architectures_keywords = [
    "fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
    "encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]

In [ ]:

Copied!





# Function that matchs a list of specific words with a given column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
    # Convert the string_list input to a set for faster membership checking
    strings_set = set(strings_list)
    
    # Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list' 
    # This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
    filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
    
    # Create a dictionary to store the counting results
    results_dict = {'String': [], 'Occurrences':[]}
    
    # Iterate over the strings list
    for string in strings_list:
        # Add the string and its corresponding count to the dictionary
        results_dict['String'].append(string)
      
        # Count the actual ocurrences in the filtered dataframe
        actual_occurrences = filtered_df[column_name].str.count(string).sum()
        results_dict['Occurrences'].append(actual_occurrences)
    
    # Convert the dictionary to a dataframe
    counts_df = pd.DataFrame(results_dict)
    
    return counts_df
    
# Function that matchs a list of specific words with a given column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
    # Convert the string_list input to a set for faster membership checking
    strings_set = set(strings_list)
    
    # Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list' 
    # This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
    filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
    
    # Create a dictionary to store the counting results
    results_dict = {'String': [], 'Occurrences':[]}
    
    # Iterate over the strings list
    for string in strings_list:
        # Add the string and its corresponding count to the dictionary
        results_dict['String'].append(string)
      
        # Count the actual ocurrences in the filtered dataframe
        actual_occurrences = filtered_df[column_name].str.count(string).sum()
        results_dict['Occurrences'].append(actual_occurrences)
    
    # Convert the dictionary to a dataframe
    counts_df = pd.DataFrame(results_dict)
    
    return counts_df
    

In [ ]:

Copied!

result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [ ]:

Copied!





fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in Kaggle text data competitions', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in Kaggle text data competitions', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

Answer: The top 3 architectures used in Kaggle text data competitions are BERT, DEBERTA, and ROBERTA.

Which of the following techniques is mostly used in the solutions of text data competitions?¶

In [ ]:

Copied!





techniques_keywords = [
    "pseudo labeling",
    "masked language modeling",
    "adversarial weight perturbation",
    "model ensembling",
    "model efficiency",
    "data augmentation"
]
techniques_keywords = [
    "pseudo labeling",
    "masked language modeling",
    "adversarial weight perturbation",
    "model ensembling",
    "model efficiency",
    "data augmentation"
]

In [ ]:

Copied!

result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [ ]:

Copied!





fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Kaggle writeups')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Kaggle writeups')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

Answer: Pseudo labeling is the most referred technique along with data augmentation. Interestingly, it seems that kagglers didn't worry at all about optimizing their model's efficiency.

Which of the following domains is mostly referred in the solutions of text data competitions?¶

In [ ]:

Copied!





domain_keywords = [
     "text mining",
     "text analytics",
     "text preprocessing",
     "text classification",
     "text clustering",
     "named entity recognition",
     "topic modeling",
     "information retrieval",
     "text summarization",
     "text generation",
     "text similarity",
     "word embeddings",
     "document classification",
     "text feature extraction",
     "text segmentation",
     "text normalization",
     "text corpora",
     "textual data analysis",
     "question answering",
     "sentiment analysis",
     "language modeling"    
]
domain_keywords = [
     "text mining",
     "text analytics",
     "text preprocessing",
     "text classification",
     "text clustering",
     "named entity recognition",
     "topic modeling",
     "information retrieval",
     "text summarization",
     "text generation",
     "text similarity",
     "word embeddings",
     "document classification",
     "text feature extraction",
     "text segmentation",
     "text normalization",
     "text corpora",
     "textual data analysis",
     "question answering",
     "sentiment analysis",
     "language modeling"    
]

In [ ]:

Copied!

result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', domain_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
result = count_ocurrences_in_dataframe(text_past2years_df, 'cleaned_text', domain_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [ ]:

Copied!





fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Kaggle writeups')
ax.set_xlim([0, 12])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()
fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Kaggle writeups')
ax.set_xlim([0, 12])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

Answer: Question and answering and text classification domains are the most referred in the collection of writeups.

In [ ]:

Copied!

# Freeing up memory 
gc.collect()
# Freeing up memory 
gc.collect()

4. Analyzing the Arxiv Dataset¶

Likewise the Kaggle writeup dataset, we're going to analyze the Arxiv dataset to gain more insights about the strategies followed in Academia related to text data.

In [ ]:

Copied!





# Loading the Arxiv Dataset
df_arxiv = pd.read_json(
    '/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json',
    lines = True, 
    convert_dates = True, 
    chunksize = 100000
)
# Loading the Arxiv Dataset
df_arxiv = pd.read_json(
    '/kaggle/input/2023-kaggle-ai-report/arxiv_metadata_20230510.json',
    lines = True, 
    convert_dates = True, 
    chunksize = 100000
)

In [ ]:

Copied!





# Reading a single chunk from the Arxiv dataset 
for chunk in df_arxiv:
    break
len(chunk)
# Reading a single chunk from the Arxiv dataset 
for chunk in df_arxiv:
    break
len(chunk)

In [ ]:

Copied!

chunk.head(3)
chunk.head(3)

In [ ]:

Copied!

chunk.info(memory_usage='deep')
chunk.info(memory_usage='deep')

In [ ]:

Copied!





# Reading all chunks and concatenating them into a single dataframe
arxiv_df = pd.DataFrame()
for chunk in df_arxiv:
    arxiv_df = pd.concat([arxiv_df, chunk], ignore_index=True)
arxiv_df.head(5) 
# Reading all chunks and concatenating them into a single dataframe
arxiv_df = pd.DataFrame()
for chunk in df_arxiv:
    arxiv_df = pd.concat([arxiv_df, chunk], ignore_index=True)
arxiv_df.head(5) 

In [ ]:

Copied!

arxiv_df.info(memory_usage='deep')
arxiv_df.info(memory_usage='deep')

We found around 2.15 M papers in the Arxiv dataset. This dataframe has a memory usage of 4.0 GB.

How many distinct categories of papers are present in the ArXiv Dataset?¶

In [ ]:

Copied!

print(f"The Arxiv Dataset has {arxiv_df['categories'].nunique()} unique categories")
print(f"The Arxiv Dataset has {arxiv_df['categories'].nunique()} unique categories")

Answer: The Arxiv dataset has around 76 k categories

What are the earliest and latest papers in the Arxiv Dataset?¶

In [ ]:

Copied!

# Turning the 'update_date' column into a datetime format column 
arxiv_df['update_date'] = pd.to_datetime(arxiv_df['update_date'])
arxiv_df.info()
# Turning the 'update_date' column into a datetime format column 
arxiv_df['update_date'] = pd.to_datetime(arxiv_df['update_date'])
arxiv_df.info()

In [ ]:

Copied!

arxiv_date_min = arxiv_df['update_date'].min().strftime('%Y-%m-%d')
arxiv_date_max = arxiv_df['update_date'].max().strftime('%Y-%m-%d')
print(f"The Arxiv Dataset includes papers from {arxiv_date_min} to {arxiv_date_max}")
arxiv_date_min = arxiv_df['update_date'].min().strftime('%Y-%m-%d')
arxiv_date_max = arxiv_df['update_date'].max().strftime('%Y-%m-%d')
print(f"The Arxiv Dataset includes papers from {arxiv_date_min} to {arxiv_date_max}")

Answer: The Arxiv Dataset contains papers from 2007-05-23 to 2023-05-05

How many papers has the Arxiv Dataset from the past two years?¶

In [ ]:

Copied!

arxiv_2years_df = arxiv_df[(arxiv_df['update_date'].dt.year >= 2021) & (arxiv_df['update_date'].dt.month >= 1)].copy()
arxiv_2years_df = arxiv_2years_df.reset_index(drop=True)
arxiv_2years_df.head(3)
arxiv_2years_df = arxiv_df[(arxiv_df['update_date'].dt.year >= 2021) & (arxiv_df['update_date'].dt.month >= 1)].copy()
arxiv_2years_df = arxiv_2years_df.reset_index(drop=True)
arxiv_2years_df.head(3)

In [ ]:

Copied!

print(f"The Arxiv Dataset contains {len(arxiv_2years_df)} papers from January 2021 to 2023")
print(f"The Arxiv Dataset contains {len(arxiv_2years_df)} papers from January 2021 to 2023")

Answer: The Arxiv Dataset contains around 527 k papers from January 2021 to 2023

What are the papers with most categories from the past two years?¶

In [ ]:

Copied!

arxiv_2years_df['categories'].value_counts().head(20)
arxiv_2years_df['categories'].value_counts().head(20)

Answer: Computer Vision takes the lead on number of papers followed by quantum physics.

What are the papers with less categories from the past two years?¶

In [ ]:

Copied!

arxiv_2years_df['categories'].value_counts().tail(20)
arxiv_2years_df['categories'].value_counts().tail(20)

Answer: We noticed that categories that are concatenated in a single row are hard to identify as unique categories

To identify the most popular categories in the NLP domain in the Arxiv Dataset, we searched the term Natural Language Processing in the arxiv dataset from 2021-01 to 2023. Then we ordered the results by Annnoucement date (oldest first) and then by Annnoucement date (newest first) and identified 19 categories that were mostly referred by researchers.

In [ ]:

Copied!





# Filtering out only text data related papers from the Arxiv dataset
nlp_categories_arxiv = [
    "cs.SE",      # Software Engineering
    "cs.CY",      # Computers and Society
    "cs.IR",      # Information Retrieval
    "cs.CL",      # Computation and Language
    "cs.LG",      # Machine Learning
    "cs.NE",      # Neural and Evolutionary Computing
    "cs.AI",      # Artificial Intelligence
    "cs.DL",      # Digital Libraries
    "cs.HC",      # Human Computer Interaction
    "cs.SI",      # Social and Information Networks
    "stat.ML",    # Machine Learning
    "cs.SD",      # Sound
    "cs.CR",      # Cryptography and Security
    "q-fin.ST",   # Statistical Finance
    "quant-ph",   # Quantum Physics
    "q-bio.OT",   # Other Quantitative Biology
    "physics.comp-ph", # Computational Physics
    "physics.data-an", # Data Analysis, Statistics, and Probability
    "cs.AR"            # Hardware Architecture
]
# Filtering out only text data related papers from the Arxiv dataset
nlp_categories_arxiv = [
    "cs.SE",      # Software Engineering
    "cs.CY",      # Computers and Society
    "cs.IR",      # Information Retrieval
    "cs.CL",      # Computation and Language
    "cs.LG",      # Machine Learning
    "cs.NE",      # Neural and Evolutionary Computing
    "cs.AI",      # Artificial Intelligence
    "cs.DL",      # Digital Libraries
    "cs.HC",      # Human Computer Interaction
    "cs.SI",      # Social and Information Networks
    "stat.ML",    # Machine Learning
    "cs.SD",      # Sound
    "cs.CR",      # Cryptography and Security
    "q-fin.ST",   # Statistical Finance
    "quant-ph",   # Quantum Physics
    "q-bio.OT",   # Other Quantitative Biology
    "physics.comp-ph", # Computational Physics
    "physics.data-an", # Data Analysis, Statistics, and Probability
    "cs.AR"            # Hardware Architecture
]

In [ ]:

Copied!

nlp_arxiv_2years_df = arxiv_2years_df[arxiv_2years_df.categories.isin(nlp_categories_arxiv)]
nlp_arxiv_2years_df = nlp_arxiv_2years_df.reset_index(drop=True)
print(f"Number of papers from the past two years: {len(arxiv_2years_df)} \nNumber of NLP-related papers from the past two years: {len(nlp_arxiv_2years_df)}")
nlp_arxiv_2years_df = arxiv_2years_df[arxiv_2years_df.categories.isin(nlp_categories_arxiv)]
nlp_arxiv_2years_df = nlp_arxiv_2years_df.reset_index(drop=True)
print(f"Number of papers from the past two years: {len(arxiv_2years_df)} \nNumber of NLP-related papers from the past two years: {len(nlp_arxiv_2years_df)}")

In [ ]:

Copied!

nlp_arxiv_2years_df.sort_values('update_date', ascending=True)
nlp_arxiv_2years_df.sort_values('update_date', ascending=True)

Answer: There is a total of 527 k papers from the past two years (from January 2021 to May 2023) and around 46 k of them corresponding to the NLP domain.

In [ ]:

Copied!

nlp_keywords_serie = nlp_arxiv_2years_df['categories'].value_counts()
nlp_keywords_serie
nlp_keywords_serie = nlp_arxiv_2years_df['categories'].value_counts()
nlp_keywords_serie

In [ ]:

Copied!

import plotly.express as px

keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_serie.index,'Count': nlp_keywords_serie.values})
fig = px.bar(keywords_count_df, x='Count', y='Keywords', title='NLP-related papers per category found in the Arxiv dataset', orientation='h', width=750, height=900, color='Keywords')
fig.show()
import plotly.express as px

keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_serie.index,'Count': nlp_keywords_serie.values})
fig = px.bar(keywords_count_df, x='Count', y='Keywords', title='NLP-related papers per category found in the Arxiv dataset', orientation='h', width=750, height=900, color='Keywords')
fig.show()

Answer: Interestingly, Quantum Physics takes the lead in NLP-related papers followed by Computation and Language, and Machine Learning papers.

5. Extracting keywords from Arxiv papers¶

In this section, we analyze the content of abstracts of papers using the PKE (Python Keyword Extraction) module to identify main keywords.

Cleaning the Arxiv Dataset¶

Before stepping into this task, it's paramount to implement a cleaning text data stage.

In [ ]:

Copied!

# Eliminating duplicates 
nlp_arxiv_2years_cleaned_df = nlp_arxiv_2years_df.drop_duplicates(subset=['title'])
len(nlp_arxiv_2years_df), len(nlp_arxiv_2years_cleaned_df)
# Eliminating duplicates 
nlp_arxiv_2years_cleaned_df = nlp_arxiv_2years_df.drop_duplicates(subset=['title'])
len(nlp_arxiv_2years_df), len(nlp_arxiv_2years_cleaned_df)

In [ ]:

Copied!

# Printing a sample abstract 
nlp_arxiv_2years_cleaned_df['abstract'][0]
# Printing a sample abstract 
nlp_arxiv_2years_cleaned_df['abstract'][0]

In [ ]:

Copied!





# Creating a function that performs several text data cleaning steps 
def clean_text(df, col_to_clean):

    # Remove HTML tags
    df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
 
    # Remove brackets and apostrophes from Python lists
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
    
    # Remove change of line characters 
    df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
   
    # Remove special characters
    df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
    df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
     
    # Lowercase text
    df['cleaned_text'] = df['cleaned_text'].str.lower()
    
    return df
# Creating a function that performs several text data cleaning steps 
def clean_text(df, col_to_clean):

    # Remove HTML tags
    df['cleaned_text'] = df[col_to_clean].apply(lambda x: re.sub('<[^<]+?>', ' ', x))
 
    # Remove brackets and apostrophes from Python lists
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r"[\[\]'\"]"," ", x))
    
    # Remove change of line characters 
    df['cleaned_text'] = df['cleaned_text'].str.replace("\n", " ", regex=True)
   
    # Remove special characters
    df['cleaned_text'] = df['cleaned_text'].str.replace("-", "", regex=False)
    df['cleaned_text'] = df['cleaned_text'].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
     
    # Lowercase text
    df['cleaned_text'] = df['cleaned_text'].str.lower()
    
    return df

In [ ]:

Copied!





# Applying the `clean_text()` function on abstracts
nlp_arxiv_2years_copy_df = nlp_arxiv_2years_cleaned_df.copy()
nlp_arxiv_2years_copy_df = clean_text(nlp_arxiv_2years_copy_df, 'abstract')

# Printing a cleaned abstract
nlp_arxiv_2years_copy_df['cleaned_text'][0]
# Applying the `clean_text()` function on abstracts
nlp_arxiv_2years_copy_df = nlp_arxiv_2years_cleaned_df.copy()
nlp_arxiv_2years_copy_df = clean_text(nlp_arxiv_2years_copy_df, 'abstract')

# Printing a cleaned abstract
nlp_arxiv_2years_copy_df['cleaned_text'][0]

In [ ]:

Copied!

nlp_arxiv_2years_copy_df.head(1)
nlp_arxiv_2years_copy_df.head(1)

Computing keyword extraction on the Arxiv Dataset¶

The keyword extraction stage is based on the TfIdf (Term Frequency-Inverse Document Frequency) method from PKE. TfIdf is a popular and effective technique for identifying keyphrases in a collection of text documents. We have created the extract_keywords() function to extract the top 5 keywords from each abstract.

In [ ]:

Copied!





# Cleaning up memory 
import gc

del df_arxiv
del chunk
del arxiv_df
del arxiv_2years_df
del nlp_arxiv_2years_df
del nlp_arxiv_2years_cleaned_df

gc.collect() 
# Cleaning up memory 
import gc

del df_arxiv
del chunk
del arxiv_df
del arxiv_2years_df
del nlp_arxiv_2years_df
del nlp_arxiv_2years_cleaned_df

gc.collect() 

In [ ]:

Copied!

!pip install git+https://github.com/boudinfl/pke.git
!pip install git+https://github.com/boudinfl/pke.git

In [ ]:

Copied!





def extract_keywords(text):
    stoplist = list(string.punctuation)
    stoplist += pke.lang.stopwords.get('en')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text,
                           language='en',
                           stoplist=stoplist,
                           normalization=None)
 
    extractor.candidate_selection() #Select 1 to 3 grams
    extractor.candidate_weighting() #Candidate weighting using document frequencies
    keyphrases = extractor.get_n_best(n=10)
    
    # Extract top 5 keywords
    keywords = [keyword[0] for keyword in keyphrases[:5]]
   
    return keywords
def extract_keywords(text):
    stoplist = list(string.punctuation)
    stoplist += pke.lang.stopwords.get('en')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text,
                           language='en',
                           stoplist=stoplist,
                           normalization=None)
 
    extractor.candidate_selection() #Select 1 to 3 grams
    extractor.candidate_weighting() #Candidate weighting using document frequencies
    keyphrases = extractor.get_n_best(n=10)
    
    # Extract top 5 keywords
    keywords = [keyword[0] for keyword in keyphrases[:5]]
   
    return keywords

In [ ]:

Copied!

nlp_arxiv_2years_copy_df['cleaned_text'].info(memory_usage='deep')
nlp_arxiv_2years_copy_df['cleaned_text'].info(memory_usage='deep')

Important: The size of text-related papers (52.5 MB) is 250 times larger than that of the processed Kaggle writeups collection (208 KB). This implies that the keyword extraction process for the ArXiv dataset will experience a significant RAM memory overload if using the standard CPU settings. For this reason, we have selected only 5,000 abstracts to prove that the keyword extraction process works by identifying 25,000 keywords (5*5,000).

In [ ]:

Copied!

# Selecting only 5,000 abstracts to be processed
clean_abstract = nlp_arxiv_2years_copy_df['cleaned_text'][:5000]
clean_abstract.info(memory_usage='deep')
# Selecting only 5,000 abstracts to be processed
clean_abstract = nlp_arxiv_2years_copy_df['cleaned_text'][:5000]
clean_abstract.info(memory_usage='deep')

Implementing a progress bar to track the keyword extraction process¶

In [ ]:

Copied!

!pip install tqdm
!pip install tqdm

In [ ]:

Copied!

from tqdm import tqdm
from tqdm import tqdm

In [ ]:

Copied!





# Creating a tqdm progress bar to track the keyword extraction process. It takes about 6 h
cleanup_interval = 20

with tqdm(total=len(clean_abstract), desc="Processing") as pbar:
    def apply_with_progress(text):
        result = extract_keywords(text)
        pbar.update(1)  # Update the progress bar
        # Check if it's time to clean up memory
        if pbar.n % cleanup_interval == 0:
            gc.collect()
        return result

    # Apply the function to the Series with progress tracking
    abstract_keywords = clean_abstract.apply(apply_with_progress)
# Creating a tqdm progress bar to track the keyword extraction process. It takes about 6 h
cleanup_interval = 20

with tqdm(total=len(clean_abstract), desc="Processing") as pbar:
    def apply_with_progress(text):
        result = extract_keywords(text)
        pbar.update(1)  # Update the progress bar
        # Check if it's time to clean up memory
        if pbar.n % cleanup_interval == 0:
            gc.collect()
        return result

    # Apply the function to the Series with progress tracking
    abstract_keywords = clean_abstract.apply(apply_with_progress)

In [ ]:

Copied!

abstract_keywords[100:]
abstract_keywords[100:]

Plotting identified keywords from 5,000 papers¶

In [ ]:

Copied!





# Finding the most popular keywords from 5,000 papers
nlp_keywords_serie = abstract_keywords.explode()
nlp_keywords_count = nlp_keywords_serie.value_counts()
nlp_keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_count.index,'Count': nlp_keywords_count.values})
nlp_keywords_count_df.head(5)
# Finding the most popular keywords from 5,000 papers
nlp_keywords_serie = abstract_keywords.explode()
nlp_keywords_count = nlp_keywords_serie.value_counts()
nlp_keywords_count_df = pd.DataFrame({'Keywords': nlp_keywords_count.index,'Count': nlp_keywords_count.values})
nlp_keywords_count_df.head(5)

In [ ]:

Copied!

# Selecting the top 50 words
nlp_keywords_count50_df = nlp_keywords_count_df[:50]
# Selecting the top 50 words
nlp_keywords_count50_df = nlp_keywords_count_df[:50]

In [ ]:

Copied!

import plotly.express as px

fig = px.bar(nlp_keywords_count50_df, x='Count', y='Keywords', title='Top 50 keywords found in 5,000 abstracts by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()
import plotly.express as px

fig = px.bar(nlp_keywords_count50_df, x='Count', y='Keywords', title='Top 50 keywords found in 5,000 abstracts by the TfIdf method', orientation='h', width=750, height=900, color='Keywords')
fig.show()

In [ ]:

Copied!





# Filtering out duplicated keywords
keywords_tree = nlp_keywords_serie.to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")
# Filtering out duplicated keywords
keywords_tree = nlp_keywords_serie.to_list()
set_keywords_tree = set(keywords_tree)
lst_keywords_tree = list(set_keywords_tree)
print(f"Total keywords: {len(keywords_tree)} \nUnique keywords: {len(lst_keywords_tree)}")

In [ ]:

Copied!





# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
    text=' '.join(lst_keywords_tree), 
    icon_name='fas fa-tree',                     # 'fas fa-cloud'; 'fas fa-eye'; ''
    palette='cmocean.sequential.Matter_10',
    background_color='black',
    gradient='horizontal',
    size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)
# Creating a word cloud image using stylecloud
stylecloud.gen_stylecloud(
    text=' '.join(lst_keywords_tree), 
    icon_name='fas fa-tree',                     # 'fas fa-cloud'; 'fas fa-eye'; ''
    palette='cmocean.sequential.Matter_10',
    background_color='black',
    gradient='horizontal',
    size=1024
)
Image(filename="./stylecloud.png", width=1024, height=768)

6. Examining popular architectures, domains, and techniques in the Arxiv dataset based on word occurrences¶

Likewise the analysis on the Kaggle writeup dataset, in this section we make specific questions and provide keywords to narrow down our analysis. We will focus specifically on the techniques, text data domains, and architectures mostly employed in the papers of the Arxiv dataset.

What are the main architectures used in Academia?¶

We considered the following 16 architectures as keywords for this question.

In [ ]:

Copied!





text_architectures_keywords = [
    "fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
    "encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]
text_architectures_keywords = [
    "fasttext", "roberta", "bert", "gpt", "rnn", "cnn", "gru", "t5", "electra", "xlnet",
    "encoder", "decoder", "lstm", "transformer", "deberta", "codebert"
]

In [ ]:

Copied!





# Function that matchs a list of specific words with a column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
    # Convert the string_list input to a set format for faster membership checking
    strings_set = set(strings_list)
    
    # Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list' 
    # This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
    filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
    
    # Create a dictionary to store the counting results
    results_dict = {'String': [], 'Occurrences':[]}
    
    # Iterate over the strings list
    for string in strings_list:
        # Add the string and its corresponding count to the dictionary
        results_dict['String'].append(string)
      
        # Count the actual ocurrences in the filtered dataframe
        actual_occurrences = filtered_df[column_name].str.count(string).sum()
        results_dict['Occurrences'].append(actual_occurrences)
    
    # Convert the dictionary to a dataframe
    counts_df = pd.DataFrame(results_dict)
    
    return counts_df
    
# Function that matchs a list of specific words with a column of a dataframe
def count_ocurrences_in_dataframe(df, column_name, strings_list):
    # Convert the string_list input to a set format for faster membership checking
    strings_set = set(strings_list)
    
    # Filter out the dataframe to only include rows where 'column_name' contains any of the strings in 'strings_list' 
    # This is used to create a regular expression pattern where the '|' pipe acts as an "OR" operator.
    filtered_df = df[df[column_name].str.contains('|'.join(strings_set))]
    
    # Create a dictionary to store the counting results
    results_dict = {'String': [], 'Occurrences':[]}
    
    # Iterate over the strings list
    for string in strings_list:
        # Add the string and its corresponding count to the dictionary
        results_dict['String'].append(string)
      
        # Count the actual ocurrences in the filtered dataframe
        actual_occurrences = filtered_df[column_name].str.count(string).sum()
        results_dict['Occurrences'].append(actual_occurrences)
    
    # Convert the dictionary to a dataframe
    counts_df = pd.DataFrame(results_dict)
    
    return counts_df
    

In [ ]:

Copied!

result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_architectures_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [ ]:

Copied!





fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in the Arxiv Dataset', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Architectures')
ax.set_xlabel('Occurrences')
ax.set_title('Architectures used in the Arxiv Dataset', fontsize=12)
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

Answer: It seems that BERT and encoder-based architectures take the lead in Academia along with trasformers.

Which of the following techniques is mostly used in Academia?¶

In [ ]:

Copied!





techniques_keywords = [
    "pseudo labeling",
    "masked language modeling",
    "adversarial weight perturbation",
    "model ensembling",
    "model efficiency",
    "data augmentation"
]
techniques_keywords = [
    "pseudo labeling",
    "masked language modeling",
    "adversarial weight perturbation",
    "model ensembling",
    "model efficiency",
    "data augmentation"
]

In [ ]:

Copied!

result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', techniques_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [ ]:

Copied!





fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Academia')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()
fig, ax= plt.subplots(1, 1, figsize=(8,4))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Techniques')
ax.set_xlabel('Occurrences')
ax.set_title('Trending NLP techniques found in Academia')
#ax.set_limits([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

Answer: Unlike the text-related Kaggle writeups, researchers are more interested in data augmentation techniques rather than pseudo labeling.

Which of the following domains is mostly referred in Academia?¶

In [ ]:

Copied!





text_data_keywords = [
     "text mining",
     "text analytics",
     "text preprocessing",
     "text classification",
     "text clustering",
     "named entity recognition",
     "topic modeling",
     "information retrieval",
     "text summarization",
     "text generation",
     "text similarity",
     "word embeddings",
     "document classification",
     "text feature extraction",
     "text segmentation",
     "text normalization",
     "text corpora",
     "textual data analysis",
     "question answering",
     "sentiment analysis",
     "language modeling"    
]
text_data_keywords = [
     "text mining",
     "text analytics",
     "text preprocessing",
     "text classification",
     "text clustering",
     "named entity recognition",
     "topic modeling",
     "information retrieval",
     "text summarization",
     "text generation",
     "text similarity",
     "word embeddings",
     "document classification",
     "text feature extraction",
     "text segmentation",
     "text normalization",
     "text corpora",
     "textual data analysis",
     "question answering",
     "sentiment analysis",
     "language modeling"    
]

In [ ]:

Copied!

result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_data_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result
result = count_ocurrences_in_dataframe(nlp_arxiv_2years_copy_df, 'cleaned_text', text_data_keywords)
sorted_result = result.sort_values('Occurrences', ascending=False)
sorted_result

In [ ]:

Copied!





fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Academia')
ax.set_xlim([0, 800])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()
fig, ax= plt.subplots(1, 1, figsize=(8,6))
sns.barplot(x = sorted_result['Occurrences'], y = sorted_result['String'], palette='flare')

ax.set_ylabel('Domains')
ax.set_xlabel('Occurrences')
ax.set_title('NLP domains found in Academia')
ax.set_xlim([0, 800])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.xticks(rotation=90)
plt.show()

Answer: Regarding the NLP domains, we found agreement of interest in both the Kaggle community and Academia focusing their efforts on question and answering and text classification fields.

Conclusion¶

By considering 9 text-data-related competitions instead of 5, we identified 208 writeups (70 times more data to be analyzed than in our previous EDA). This helped us gain a better understanding of Kaggle text data competitions. We've also expanded our consideration to 19 categories that could contain NLP papers, as opposed to the previous 12, for the ArXiv dataset. Here's a general summary of our findings over the last two years:

BERT and encoder-based architectures are the most popular in both the Kaggle community and academic contexts.
Pseudo labeling was the most frequently referenced technique in text data writeups, while data augmentation was more prevalent in text-related papers. It appears that researchers prioritize model efficiency, whereas Kagglers might overlook it in their solutions.
Both the Kaggle community and academia are increasingly focusing their efforts on Question and Answer (Q&A) and text classification domains.

Appendix¶

Finally, here are some useful tips for processing large datasets:

Keep an eye on the RAM memory usage at every stage of your dataset analysis. You can:
- Assess the memory size of dataframes using the df.info(memory_usage='deep') command
- Consider removing dataframes that you no longer need with del df
- Free up memory whenever possible using the gc.collect() command.
- Use the following commands to assess the memory usage of your variables:
```
    from __future__ import print_function  # for Python2
    import sys

    local_vars = list(locals().items())
    for var, obj in local_vars:
    print(var, sys.getsizeof(obj))    
```
Implement a progress bar when executing large processes

In Depth Analysis of Kaggle and Arxiv Datasets¶

1. Analyzing Kaggle Writeups Dataset¶

How many unique competitions has the Kaggle writeups dataset?¶

What are the earliest and latest competitions?¶

How many competitions are from the past two years?¶

What are the competitions from the past two years with most writeups?¶

What are the competitions from the past two years with less writeups?¶

What is the number of writeups corresponding to text data competitions held in the past two years?¶

What is the number of writeups per text data competition?¶

2. Extracting keywords from text data writeups¶

Cleaning text data¶

Computing the frequency of keywords in writeups¶

Extracting keywords from writeups¶

Plotting extracted keywords from writeups¶

3. Examining popular architectures, domains, and techniques used in Kaggle writeups based on word occurrences¶

What are the main architectures used in the solutions of text data competitions?¶

Which of the following techniques is mostly used in the solutions of text data competitions?¶

Which of the following domains is mostly referred in the solutions of text data competitions?¶

4. Analyzing the Arxiv Dataset¶

How many distinct categories of papers are present in the ArXiv Dataset?¶

What are the earliest and latest papers in the Arxiv Dataset?¶

How many papers has the Arxiv Dataset from the past two years?¶

What are the papers with most categories from the past two years?¶

What are the papers with less categories from the past two years?¶

What is the number of NLP-related papers from the past two years?¶

What is the count of NLP-related papers per category from the past two years?¶

5. Extracting keywords from Arxiv papers¶

Cleaning the Arxiv Dataset¶

Computing keyword extraction on the Arxiv Dataset¶

Implementing a progress bar to track the keyword extraction process¶

Plotting identified keywords from 5,000 papers¶

6. Examining popular architectures, domains, and techniques in the Arxiv dataset based on word occurrences¶

What are the main architectures used in Academia?¶

Which of the following techniques is mostly used in Academia?¶

Which of the following domains is mostly referred in Academia?¶

Conclusion¶

Appendix¶