Classification of DNA Splice Junctions¶

by Salomon Marquez

25/06/2025

Splice junctions are points within a DNA sequence where “superfluous” DNA is removed during the protein synthesis process in higher organisms.

The goal of this project is to identify, given a DNA sequence, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are cut out). In the biological community, exon-intron (EI) boundaries are referred to as acceptors, while intron-exon (IE) boundaries are known as donors.

To predict the type of splicing site, several machine learning algorithms will be implemented, including: k-Nearest Neighbour, Naive Bayes, Artificial Neural Network, Support Vector Machine, Decision Tree, and Random Forest. Finally, the following metrics will be evaluated: precision, recall, f1-score, and error, in order to determine which algorithms performed best for this task.

Visit the repository of the project to check out:

Splice junction data
Colab notebook

1 Installation of Dependencies and Setup of the Working Directory¶

In [ ]:

Copied!





import numpy as np
import pandas as pd
import os
import random
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Conv1D, Conv1DTranspose, Flatten, Dense, Reshape, Dropout

from IPython.display import Image
import numpy as np
import pandas as pd
import os
import random
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Conv1D, Conv1DTranspose, Flatten, Dense, Reshape, Dropout

from IPython.display import Image

In [ ]:

Copied!





# Fijar semilla para obtener resultados reproducibles cuando se ejecuta el notebook
# Fijar la semilla
seed_value = 123
os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)
# Fijar semilla para obtener resultados reproducibles cuando se ejecuta el notebook
# Fijar la semilla
seed_value = 123
os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

In [ ]:

Copied!





# Obtener el directorio actual
actual_wd = os.getcwd()
print("Ruta actual:", actual_wd)

# Establecer directorio de trabajo en Gdrive
# os.chdir("/content/drive/MyDrive/ASIGNATURAS/M0.163 MACHINE LEARNING/[28 MAY - 17 JUN] RETO 4/PEC4")
# Obtener el directorio actual
actual_wd = os.getcwd()
print("Ruta actual:", actual_wd)

# Establecer directorio de trabajo en Gdrive
# os.chdir("/content/drive/MyDrive/ASIGNATURAS/M0.163 MACHINE LEARNING/[28 MAY - 17 JUN] RETO 4/PEC4")

In [ ]:

Copied!

# Visualizar contenido del directorio de trabajo
!ls
# Visualizar contenido del directorio de trabajo
!ls

autoencoder_image.png	 enunciado_PEC4_2425_2.pdf    splice.csv
deep_descriptors.csv	 PEC4_Machine_Learning.html
deep_descriptors.gsheet  PEC4_Machine_Learning.ipynb

2 Data Reading and Preparation¶

In this section, we will answer some general questions about the dataset contained in the file splice.csv

2.1 Read Data¶

In [ ]:

Copied!

# Especificar el nombre del archivo de origen
file_name = "splice.csv"
file_path = os.path.join(actual_wd, file_name)
# Especificar el nombre del archivo de origen
file_name = "splice.csv"
file_path = os.path.join(actual_wd, file_name)

In [ ]:

Copied!

# Guardar en un dataframe el contenido de splice.csv
df_splice = pd.read_csv(file_path, delimiter=',')
# Guardar en un dataframe el contenido de splice.csv
df_splice = pd.read_csv(file_path, delimiter=',')

In [ ]:

Copied!

# Visualizar contenido del dataframe splice
df_splice.info()
# Visualizar contenido del dataframe splice
df_splice.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3190 entries, 0 to 3189
Columns: 482 entries, class to V480
dtypes: int64(480), object(2)
memory usage: 11.7+ MB

In [ ]:

Copied!

# Visualizar las 5 primeras filas
df_splice.head(5)
# Visualizar las 5 primeras filas
df_splice.head(5)

Out[ ]:

	class	seq_name	V1	V2	V4	...	V474	V475	V476
0	EI	ATRINS-DONOR-521	0	0	1	...	1	0	0
1	EI	ATRINS-DONOR-905	1	0	0	...	0	0	1
2	EI	BABAPOE-DONOR-30	0	1	0	...	1	0	0
3	EI	BABAPOE-DONOR-867	0	1	0	...	0	0	1
4	EI	BABAPOE-DONOR-2817	0	1	0	...	0	1	0

5 rows × 482 columns

In [ ]:

Copied!

# Visualizar las 5 últimas filas
df_splice.tail(5)
# Visualizar las 5 últimas filas
df_splice.tail(5)

Out[ ]:

	class	seq_name	V1	V2	V3	...	V473	V474	V475	V476
3185	N	ORAHBPSBD-NEG-2881	0	0	1	...	0	0	1	0
3186	N	ORAINVOL-NEG-2161	0	1	0	...	0	1	0	0
3187	N	ORARGIT-NEG-241	0	0	1	...	0	0	0	1
3188	N	TARHBB-NEG-541	1	0	0	...	1	0	0	0
3189	N	TARHBD-NEG-1981	1	0	0	...	0	0	1	0

5 rows × 482 columns

2.2 How many unique `seq_name` entries does the dataset have?¶

In [ ]:

Copied!

# Obtener el número de secuencias únicas
n_unique = df_splice['seq_name'].nunique()
n_unique
# Obtener el número de secuencias únicas
n_unique = df_splice['seq_name'].nunique()
n_unique

Out[ ]:

In [ ]:

Copied!

# Calcular el número de secuencias repetidas
print(f"El dataset contiene {len(df_splice)-n_unique} secuencia repetidas")
# Calcular el número de secuencias repetidas
print(f"El dataset contiene {len(df_splice)-n_unique} secuencia repetidas")

El dataset contiene 12 secuencia repetidas

In [ ]:

Copied!

# Alternativamente para calcular duplicados
df_splice.duplicated().sum()
# Alternativamente para calcular duplicados
df_splice.duplicated().sum()

Out[ ]:

np.int64(12)

In [ ]:

Copied!

# Identificar las secuencias repetidas
df_splice_unique = df_splice['seq_name'].value_counts()
df_splice_unique.head(15)
# Identificar las secuencias repetidas
df_splice_unique = df_splice['seq_name'].value_counts()
df_splice_unique.head(15)

Out[ ]:

	count
seq_name
HUMMYC3L-ACCEPTOR-4242	2
HUMALBGC-DONOR-17044	2
HUMMYLCA-DONOR-2559	2
HUMMYLCA-DONOR-2388	2
HUMMYLCA-DONOR-1975	2
HUMMYLCA-DONOR-952	2
HUMALBGC-ACCEPTOR-18496	2
HUMMYLCA-ACCEPTOR-924	2
HUMMYLCA-ACCEPTOR-1831	2
HUMMYLCA-ACCEPTOR-2214	2
HUMMYLCA-ACCEPTOR-2481	2
HUMMYLCA-DONOR-644	2
HUMGAPJR-NEG-961	1
HUMGBR-NEG-2521	1
HUMGALAB-NEG-901	1

dtype: int64

In [ ]:

Copied!

# Visualizar un par de secuencias repetidas
df_splice[df_splice['seq_name'].str.contains('HUMMYC3L-ACCEPTOR-4242')]
# Visualizar un par de secuencias repetidas
df_splice[df_splice['seq_name'].str.contains('HUMMYC3L-ACCEPTOR-4242')]

Out[ ]:

	class	seq_name	V1	V2	V3	V4	V5	V6	V7	V8	...	V471	V472	V473	V474	V475	V476	V477	V478	V479	V480
1316	IE	HUMMYC3L-ACCEPTOR-4242	0	0	1	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
1317	IE	HUMMYC3L-ACCEPTOR-4242	0	0	1	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0

2 rows × 482 columns

In [ ]:

Copied!

# Eliminar duplicados
df_splice_n = df_splice.drop_duplicates()
print(f"El dataset originalmente contenía {len(df_splice)} registros y después de eliminar duplicados se tienen {len(df_splice_n)} registros ")
# Eliminar duplicados
df_splice_n = df_splice.drop_duplicates()
print(f"El dataset originalmente contenía {len(df_splice)} registros y después de eliminar duplicados se tienen {len(df_splice_n)} registros ")

El dataset originalmente contenía 3190 registros y después de eliminar duplicados se tienen 3178 registros

2.3 How many label types exist?¶

In [ ]:

Copied!

df_splice_label = df_splice_n['class'].value_counts()
df_splice_label
df_splice_label = df_splice_n['class'].value_counts()
df_splice_label

Out[ ]:

	count
class
N	1655
IE	762
EI	761

dtype: int64

We have an imbalanced dataset where slightly more than 50% of the sequences belong to the “no splicing” category. This must be taken into account because the classification models proposed below could become biased toward predicting "N."

2.4 Prepare the encoded sequences in the dataset¶

Here we define the variable seq_encoded, which will contain the 3,178 records and their 480 features.

In [ ]:

Copied!

# Seleccionar desde V1 hasta V480
seq_encoded = df_splice_n.iloc[:,2:]
# Seleccionar desde V1 hasta V480
seq_encoded = df_splice_n.iloc[:,2:]

In [ ]:

Copied!

# Verificar el tipo de variable
type(seq_encoded)
# Verificar el tipo de variable
type(seq_encoded)

Out[ ]:

pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None

/usr/local/lib/python3.11/dist-packages/pandas/core/frame.pyTwo-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3

In [ ]:

Copied!

# Convertir datos a numpy.array
seq_encoded_array = seq_encoded.to_numpy()
seq_encoded_array[:5]
# Convertir datos a numpy.array
seq_encoded_array = seq_encoded.to_numpy()
seq_encoded_array[:5]

Out[ ]:

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

2.5 Prepare the dataset labels¶

The Class variable in the splicing dataset has three categories: EI, IE, and N. No particular positive class is defined here. This variable must be converted to a numeric type so it can be used in our classification models.

In [ ]:

Copied!

labels = df_splice_n['class'].map({'EI': 0, 'IE': 1, 'N': 2})
labels = df_splice_n['class'].map({'EI': 0, 'IE': 1, 'N': 2})

In [ ]:

Copied!

# Convertir datos a numpy.array
labels_array = labels.to_numpy()
labels_array[:5]
# Convertir datos a numpy.array
labels_array = labels.to_numpy()
labels_array[:5]

Out[ ]:

array([0, 0, 0, 0, 0])

2.6 Create training and test datasets¶

In [ ]:

Copied!

# Comprobar que las variables y los labels para entrenar los modelos de clasificación sean tipo array
print(f"La secuencias codificadas por one-hot encoding son de tipo {type(seq_encoded_array)} con dimensión {seq_encoded_array.shape}\n"
      f"y la variable label es de tipo {type(labels_array)} con dimensión {labels_array.shape}")
# Comprobar que las variables y los labels para entrenar los modelos de clasificación sean tipo array
print(f"La secuencias codificadas por one-hot encoding son de tipo {type(seq_encoded_array)} con dimensión {seq_encoded_array.shape}\n"
      f"y la variable label es de tipo {type(labels_array)} con dimensión {labels_array.shape}")

La secuencias codificadas por one-hot encoding son de tipo <class 'numpy.ndarray'> con dimensión (3178, 480)
y la variable label es de tipo <class 'numpy.ndarray'> con dimensión (3178,)

In [ ]:

Copied!





# Crear train y test datasets
X_train, X_test, y_train, y_test = train_test_split(
    seq_encoded_array,
    labels_array,
    test_size = 0.33, # 33% para el dataset test como lo indica el enunciado de la PEC4
    random_state= 123 # Fijamos la semilla aleatoria en 123
)
# Crear train y test datasets
X_train, X_test, y_train, y_test = train_test_split(
    seq_encoded_array,
    labels_array,
    test_size = 0.33, # 33% para el dataset test como lo indica el enunciado de la PEC4
    random_state= 123 # Fijamos la semilla aleatoria en 123
)

In [ ]:

Copied!

# Visualizar la dimensiones de los 4 datases creados a partir de train_test_split()
X_train.shape, y_train.shape, X_test.shape, y_test.shape
# Visualizar la dimensiones de los 4 datases creados a partir de train_test_split()
X_train.shape, y_train.shape, X_test.shape, y_test.shape

Out[ ]:

((2129, 480), (2129,), (1049, 480), (1049,))

3 Implementation of the Convolutional Autoencoder¶

In this section, we build an autoencoder with the architecture shown in the figure, considering the purpose of each component as described below:

The encoder is a regular CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing the depth (i.e., the number of feature maps). The decoder must do the reverse (upscale the image and reduce its depth back to the original dimensions), and for this you can use transpose convolutional layers (alternatively, you could combine upsampling layers with convolutional layers). Texto extraído de Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition sección Convolutional Autoencoders.

Because DNA sequences are linear — an ordered chain of nucleotides — we use Conv1D to detect local motifs (such as splicing patterns).

The only way to train the encoder to generate useful representations is to force it to reconstruct the original input — this is where the decoder comes in. Without a decoder, the encoder would not learn anything meaningful, because there would be no metric to tell whether the compression preserves useful information. Once the decoder is trained, we can use the encoder alone for classification models.

In [ ]:

Copied!

Image("autoencoder_image.png",width=600, height=400)
Image("autoencoder_image.png",width=600, height=400)

Out[ ]:

No description has been provided for this image

In [ ]:

Copied!





# Reajustar las dimensiones de X_train puesto que Conv1D requiere datos en formato 3D
m, n   = X_train.shape # Dimensiones del dataset train
p, r   = X_test.shape  # Dimensiones del dataset test

X_train_reshaped = X_train.reshape((m, 60, 8))
X_test_reshaped = X_test.reshape((p, 60, 8))
# Reajustar las dimensiones de X_train puesto que Conv1D requiere datos en formato 3D
m, n   = X_train.shape # Dimensiones del dataset train
p, r   = X_test.shape  # Dimensiones del dataset test

X_train_reshaped = X_train.reshape((m, 60, 8))
X_test_reshaped = X_test.reshape((p, 60, 8))

In [ ]:

Copied!





# AUTOENCODER
# Definir capa de entrada
input_layer = Input(shape=(60, 8))

# Definir encoder
x           = Conv1D(filters=8, kernel_size=3, activation='relu', padding='same')(input_layer)
bottleneck  = Flatten()(x)
#bottleneck  = Dense(32, activation='relu')(x)

# Definir decoder
#x             = Dense(60 * 8, activation='relu')(bottleneck)
x             = Reshape((60, 8))(bottleneck)
output_layer  = Conv1DTranspose(filters=8, kernel_size=3, activation='sigmoid', padding='same')(x)

# Definir modelo
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Entrenar autoencoder
history = autoencoder.fit(X_train_reshaped, X_train_reshaped, epochs=20, batch_size=32, validation_split=0.2)
# AUTOENCODER
# Definir capa de entrada
input_layer = Input(shape=(60, 8))

# Definir encoder
x           = Conv1D(filters=8, kernel_size=3, activation='relu', padding='same')(input_layer)
bottleneck  = Flatten()(x)
#bottleneck  = Dense(32, activation='relu')(x)

# Definir decoder
#x             = Dense(60 * 8, activation='relu')(bottleneck)
x             = Reshape((60, 8))(bottleneck)
output_layer  = Conv1DTranspose(filters=8, kernel_size=3, activation='sigmoid', padding='same')(x)

# Definir modelo
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Entrenar autoencoder
history = autoencoder.fit(X_train_reshaped, X_train_reshaped, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 2s 13ms/step - loss: 0.7044 - val_loss: 0.6261
Epoch 2/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.5937 - val_loss: 0.4684
Epoch 3/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.4215 - val_loss: 0.3057
Epoch 4/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.2815 - val_loss: 0.2242
Epoch 5/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.2105 - val_loss: 0.1758
Epoch 6/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1660 - val_loss: 0.1404
Epoch 7/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.1325 - val_loss: 0.1122
Epoch 8/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1057 - val_loss: 0.0897
Epoch 9/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0842 - val_loss: 0.0718
Epoch 10/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0673 - val_loss: 0.0577
Epoch 11/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0539 - val_loss: 0.0466
Epoch 12/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0435 - val_loss: 0.0380
Epoch 13/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0354 - val_loss: 0.0313
Epoch 14/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.0291 - val_loss: 0.0261
Epoch 15/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0241 - val_loss: 0.0220
Epoch 16/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0202 - val_loss: 0.0187
Epoch 17/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0172 - val_loss: 0.0162
Epoch 18/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0147 - val_loss: 0.0141
Epoch 19/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0127 - val_loss: 0.0124
Epoch 20/20
54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0111 - val_loss: 0.0110

In [ ]:

Copied!





# Evaluar la pérdida de reconstrucción del autoencoder
train_loss  = autoencoder.evaluate(X_train_reshaped, X_train_reshaped)
test_loss   = autoencoder.evaluate(X_test_reshaped, X_test_reshaped)

print(f"\nTrain loss: {train_loss:.4f}")
print(f"Test loss: {test_loss:.4f}")
# Evaluar la pérdida de reconstrucción del autoencoder
train_loss  = autoencoder.evaluate(X_train_reshaped, X_train_reshaped)
test_loss   = autoencoder.evaluate(X_test_reshaped, X_test_reshaped)

print(f"\nTrain loss: {train_loss:.4f}")
print(f"Test loss: {test_loss:.4f}")

67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.0101
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0101

Train loss: 0.0103
Test loss: 0.0101

In [ ]:

Copied!





# Visualizar la curva de entrenamiento del autoencoder
plt.plot(history.history['loss'], label='Train loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Pérdida de entrenamiento vs validación')
plt.show()
# Visualizar la curva de entrenamiento del autoencoder
plt.plot(history.history['loss'], label='Train loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Pérdida de entrenamiento vs validación')
plt.show()

We observe that the autoencoder quality is good since the reconstruction loss is very low.

In [ ]:

Copied!

# Extraer la parte codificadora "encoder" para usarla en otros modelos
encoder = Model(inputs=input_layer, outputs=bottleneck)
X_train_encoded = encoder.predict(X_train_reshaped)
# Extraer la parte codificadora "encoder" para usarla en otros modelos
encoder = Model(inputs=input_layer, outputs=bottleneck)
X_train_encoded = encoder.predict(X_train_reshaped)

67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

In [ ]:

Copied!

# Verificar las dimensiones de las nuevas coordenadas de X_train_encoded
X_train_encoded.shape, X_train.shape
# Verificar las dimensiones de las nuevas coordenadas de X_train_encoded
X_train_encoded.shape, X_train.shape

Out[ ]:

((2129, 480), (2129, 480))

In [ ]:

Copied!

# Visualizar X_train_encoded
print(X_train_encoded)
# Visualizar X_train_encoded
print(X_train_encoded)

[[0.41865867 1.1792793  0.         ... 1.0533024  1.2378863  1.4537132 ]
 [0.97194517 1.2243328  0.7337566  ... 1.6940844  0.6968196  2.2641408 ]
 [0.7622293  0.27394313 1.5169445  ... 1.1491497  1.7781353  0.6089502 ]
 ...
 [0.         0.19901627 1.6442008  ... 0.92114854 1.2606764  0.        ]
 [0.55828255 1.0124682  0.6177381  ... 1.4002684  0.5707153  1.0705616 ]
 [0.18931112 0.25097656 1.0731099  ... 2.1333911  0.10522419 1.4503322 ]]

In [ ]:

Copied!

# Obtener las nuevas coordenadas de X_test
X_test_encoded = encoder.predict(X_test_reshaped)
# Obtener las nuevas coordenadas de X_test
X_test_encoded = encoder.predict(X_test_reshaped)

33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

In [ ]:

Copied!

# Verificar las dimensiones de las nuevas coordenadas de X_test_encoded
X_test_encoded.shape, X_test.shape
# Verificar las dimensiones de las nuevas coordenadas de X_test_encoded
X_test_encoded.shape, X_test.shape

Out[ ]:

((1049, 480), (1049, 480))

In [ ]:

Copied!

# Visualizar X_test_encoded
print(X_test_encoded)
# Visualizar X_test_encoded
print(X_test_encoded)

[[0.18931112 0.25097656 1.0731099  ... 0.8388007  1.4651122  1.0704793 ]
 [0.55828255 1.0124682  0.6177381  ... 1.4002684  0.5707153  1.0705616 ]
 [0.55828255 1.0124682  0.6177381  ... 0.8388007  1.4651122  1.0704793 ]
 ...
 [0.34228757 1.3086706  0.33683705 ... 1.0533024  1.2378863  1.4537132 ]
 [1.4116294  0.32590342 0.94585365 ... 1.0568092  1.7025597  1.0396074 ]
 [0.16956863 0.1146785  2.0411205  ... 1.7764323  0.49238378 1.1906232 ]]

4 Application of Algorithms¶

4.1 Implementing the kNN Model for k = 1, 3, 5, 7¶

To generate the kNN model predictions based on the value of k, we create a function that performs the following main tasks:

Build the model
Train the model
Make predictions
Calculate performance metrics

The function takes k, X_train_encoded, y_train, X_test_encoded, and y_test as input arguments and outputs a metrics dataframe. The model for k=3 is created first.

In [ ]:

Copied!





# Crear modelo k-nn con k = 3
k     = 3   # Valor del nearest neighbor k = [1, 3, 5, 7]
model = KNeighborsClassifier(n_neighbors=k, metric='euclidean') # Distancia euclidiana

# Entrenar modelo k-nn
model.fit(X_train_encoded, y_train)

# Hacer predicciones con el modelo k-nn entrenado usando X_test
y_pred = model.predict(X_test_encoded)

# Imprimir reporte de clasificación
# Recordar que {'EI': 0, 'IE': 1, 'N': 2}
print(classification_report(y_test, y_pred, target_names=['Clase EI', 'Clase IE', 'Clase N']))
# Crear modelo k-nn con k = 3
k     = 3   # Valor del nearest neighbor k = [1, 3, 5, 7]
model = KNeighborsClassifier(n_neighbors=k, metric='euclidean') # Distancia euclidiana

# Entrenar modelo k-nn
model.fit(X_train_encoded, y_train)

# Hacer predicciones con el modelo k-nn entrenado usando X_test
y_pred = model.predict(X_test_encoded)

# Imprimir reporte de clasificación
# Recordar que {'EI': 0, 'IE': 1, 'N': 2}
print(classification_report(y_test, y_pred, target_names=['Clase EI', 'Clase IE', 'Clase N']))

              precision    recall  f1-score   support

    Clase EI       0.61      0.93      0.74       243
    Clase IE       0.78      0.87      0.82       247
     Clase N       0.96      0.70      0.81       559

    accuracy                           0.79      1049
   macro avg       0.78      0.83      0.79      1049
weighted avg       0.83      0.79      0.79      1049

In [ ]:

Copied!





# Obtener un dataframe a partir de la matriz de confusion
report             = classification_report(y_test, y_pred, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report          = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['k']     = k
df_report['modelo'] = "kNN"
df_report.reset_index(inplace=True)
df_report.rename(columns={'index': 'clase'}, inplace=True)
df_report
# Obtener un dataframe a partir de la matriz de confusion
report             = classification_report(y_test, y_pred, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report          = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['k']     = k
df_report['modelo'] = "kNN"
df_report.reset_index(inplace=True)
df_report.rename(columns={'index': 'clase'}, inplace=True)
df_report

Out[ ]:

	clase	precision	recall	f1-score	support	error	k	modelo
0	EI	0.614754	0.925926	0.738916	243.0	0.074074	3	kNN
1	IE	0.775362	0.866397	0.818356	247.0	0.133603	3	kNN
2	N	0.955774	0.695886	0.805383	559.0	0.304114	3	kNN

In [ ]:

Copied!





# Definir función evaluar_knn_por_k()
def evaluar_knn_por_k(k, X_train, y_train, X_test, y_test):
    """
    Entrena y evalúa un modelo k-NN con valor k dado.
    Retorna un DataFrame con precision, recall, f1-score, error por clase y k.
    """
    # Entrenar modelo
    model = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    model.fit(X_train, y_train)

    # Predicción
    y_pred = model.predict(X_test)

    # Clasification report por clase
    report = classification_report(
        y_test,
        y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Convertir a DataFrame
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = 'kNN' + ' k=' + str(k)
    df_report.reset_index(inplace=True)
    df_report.rename(columns={'index': 'clase'}, inplace=True)

    return df_report
# Definir función evaluar_knn_por_k()
def evaluar_knn_por_k(k, X_train, y_train, X_test, y_test):
    """
    Entrena y evalúa un modelo k-NN con valor k dado.
    Retorna un DataFrame con precision, recall, f1-score, error por clase y k.
    """
    # Entrenar modelo
    model = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    model.fit(X_train, y_train)

    # Predicción
    y_pred = model.predict(X_test)

    # Clasification report por clase
    report = classification_report(
        y_test,
        y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Convertir a DataFrame
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = 'kNN' + ' k=' + str(k)
    df_report.reset_index(inplace=True)
    df_report.rename(columns={'index': 'clase'}, inplace=True)

    return df_report

In [ ]:

Copied!





# Probar con diferentes valores de k
resultados = []

for k in [1, 3, 5, 7]:
    df_k = evaluar_knn_por_k(k, X_train_encoded, y_train, X_test_encoded, y_test)
    resultados.append(df_k)

# Unir todos los resultados
df_knn_final = pd.concat(resultados, ignore_index=True)
df_knn_final
# Probar con diferentes valores de k
resultados = []

for k in [1, 3, 5, 7]:
    df_k = evaluar_knn_por_k(k, X_train_encoded, y_train, X_test_encoded, y_test)
    resultados.append(df_k)

# Unir todos los resultados
df_knn_final = pd.concat(resultados, ignore_index=True)
df_knn_final

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.638554	0.872428	0.737391	243.0	0.127572	kNN k=1
1	IE	0.667722	0.854251	0.749556	247.0	0.145749	kNN k=1
2	N	0.922693	0.661896	0.770833	559.0	0.338104	kNN k=1
3	EI	0.614754	0.925926	0.738916	243.0	0.074074	kNN k=3
4	IE	0.775362	0.866397	0.818356	247.0	0.133603	kNN k=3
5	N	0.955774	0.695886	0.805383	559.0	0.304114	kNN k=3
6	EI	0.684524	0.946502	0.794473	243.0	0.053498	kNN k=5
7	IE	0.751656	0.919028	0.826958	247.0	0.080972	kNN k=5
8	N	0.975669	0.717352	0.826804	559.0	0.282648	kNN k=5
9	EI	0.707317	0.954733	0.812609	243.0	0.045267	kNN k=7
10	IE	0.759868	0.935223	0.838475	247.0	0.064777	kNN k=7
11	N	0.980815	0.731664	0.838115	559.0	0.268336	kNN k=7

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_knn_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_knn_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar la error por modelo y clase
pivot_error = df_knn_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
### 4.1 Implementar el modelo kNN para k = 1, 3, 5 y 7
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_knn_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
### 4.1 Implementar el modelo kNN para k = 1, 3, 5 y 7
plt.show()

Comment on kNN performance¶

The evaluation strategy for the proposed models is the following:

Focus on F1-score because it balances precision and recall, which is important for multiclass classification.
Study the error.
Evaluate per-class performance to ensure the model is not just optimizing for class N.

kNN shows progressive improvement in F1-score as k increases, meaning it becomes more robust as more neighbors are considered. The error decreases accordingly.

Class-by-class F1-scores reveal that IE and N perform more consistently, while EI is harder to identify correctly. Interestingly, class N shows the highest error, suggesting room for improvement.

4.2 Implementing Naive Bayes¶

To evaluate the performance of the Naive Bayes model depending on the type of data in Python, two approaches are proposed.
First, the BernoulliNB classifier with Laplace smoothing (alpha = 1 in Python) is applied to binary data obtained through one-hot encoding.
Next, GaussianNB is used on continuous data derived from the convolutional autoencoder, since this model assumes a normal distribution of the variables.
This strategy allows us to compare the impact of different input transformations on the performance of the Naive Bayes classifier.

For more information on implementing the Naive Bayes model, refer to
BernoulliNB and
GaussianNB.

In [ ]:

Copied!





# Entrenar BernoulliNB sin suavizado (alpha=0) con datos one-hot encoded
modelNB_alpha0 = BernoulliNB(alpha=0.0)
modelNB_alpha0.fit(X_train, y_train)

# Predecir
y_pred_NB_alpha0 = modelNB_alpha0.predict(X_test)
print("Predicciones:", y_pred_NB_alpha0)
# Entrenar BernoulliNB sin suavizado (alpha=0) con datos one-hot encoded
modelNB_alpha0 = BernoulliNB(alpha=0.0)
modelNB_alpha0.fit(X_train, y_train)

# Predecir
y_pred_NB_alpha0 = modelNB_alpha0.predict(X_test)
print("Predicciones:", y_pred_NB_alpha0)

Predicciones: [0 0 0 ... 0 0 0]

/usr/local/lib/python3.11/dist-packages/sklearn/naive_bayes.py:1209: RuntimeWarning: divide by zero encountered in log
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
/usr/local/lib/python3.11/dist-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul
  ret = a @ b

In [ ]:

Copied!





# Obtener un dataframe a partir de la matriz de confusion
report_NB_alpha0              = classification_report(y_test, y_pred_NB_alpha0, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB_alpha0           = pd.DataFrame(report_NB_alpha0 ).T.loc[['EI', 'IE', 'N']]
df_report_NB_alpha0 ['error'] = 1 - df_report_NB_alpha0 ['recall']
df_report_NB_alpha0 ['modelo']  = "Bernoulli B alpha=0"
df_report_NB_alpha0
# Obtener un dataframe a partir de la matriz de confusion
report_NB_alpha0              = classification_report(y_test, y_pred_NB_alpha0, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB_alpha0           = pd.DataFrame(report_NB_alpha0 ).T.loc[['EI', 'IE', 'N']]
df_report_NB_alpha0 ['error'] = 1 - df_report_NB_alpha0 ['recall']
df_report_NB_alpha0 ['modelo']  = "Bernoulli B alpha=0"
df_report_NB_alpha0

/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

Out[ ]:

	precision	recall	f1-score	support	error	modelo
EI	0.231649	1.0	0.376161	243.0	0.0	Bernoulli B alpha=0
IE	0.000000	0.0	0.000000	247.0	1.0	Bernoulli B alpha=0
N	0.000000	0.0	0.000000	559.0	1.0	Bernoulli B alpha=0

In [ ]:

Copied!





# Entrenar BernoulliNB con suavizado (alpha=1) con datos one-hot encoded
modelNB_alpha1 = BernoulliNB(alpha=1.0)
modelNB_alpha1.fit(X_train, y_train)

# Predecir
y_pred_NB_alpha1 = modelNB_alpha0.predict(X_test)
print("Predicciones:", y_pred_NB_alpha1)
# Entrenar BernoulliNB con suavizado (alpha=1) con datos one-hot encoded
modelNB_alpha1 = BernoulliNB(alpha=1.0)
modelNB_alpha1.fit(X_train, y_train)

# Predecir
y_pred_NB_alpha1 = modelNB_alpha0.predict(X_test)
print("Predicciones:", y_pred_NB_alpha1)

Predicciones: [0 0 0 ... 0 0 0]

/usr/local/lib/python3.11/dist-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul
  ret = a @ b

In [ ]:

Copied!





# Obtener un dataframe a partir de la matriz de confusion
report_NB_alpha1             = classification_report(y_test, y_pred_NB_alpha1, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB_alpha1          = pd.DataFrame(report_NB_alpha1).T.loc[['EI', 'IE', 'N']]
df_report_NB_alpha1['error'] = 1 - df_report_NB_alpha1['recall']
df_report_NB_alpha1['modelo']  = "BernoulliNB alpha=1"
df_report_NB_alpha1
# Obtener un dataframe a partir de la matriz de confusion
report_NB_alpha1             = classification_report(y_test, y_pred_NB_alpha1, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB_alpha1          = pd.DataFrame(report_NB_alpha1).T.loc[['EI', 'IE', 'N']]
df_report_NB_alpha1['error'] = 1 - df_report_NB_alpha1['recall']
df_report_NB_alpha1['modelo']  = "BernoulliNB alpha=1"
df_report_NB_alpha1

/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

Out[ ]:

	precision	recall	f1-score	support	error	modelo
EI	0.231649	1.0	0.376161	243.0	0.0	BernoulliNB alpha=1
IE	0.000000	0.0	0.000000	247.0	1.0	BernoulliNB alpha=1
N	0.000000	0.0	0.000000	559.0	1.0	BernoulliNB alpha=1

In [ ]:

Copied!





# Entrenar GaussianNB con datos del autoencoder
modelNB = GaussianNB()
modelNB.fit(X_train_encoded, y_train)

# Predecir
y_pred_NB = modelNB.predict(X_test_encoded)
print("Predicciones:", y_pred_NB)
# Entrenar GaussianNB con datos del autoencoder
modelNB = GaussianNB()
modelNB.fit(X_train_encoded, y_train)

# Predecir
y_pred_NB = modelNB.predict(X_test_encoded)
print("Predicciones:", y_pred_NB)

Predicciones: [1 1 2 ... 0 2 1]

In [ ]:

Copied!





# Obtener un dataframe a partir de la matriz de confusion
report_NB             = classification_report(y_test, y_pred_NB, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB          = pd.DataFrame(report_NB).T.loc[['EI', 'IE', 'N']]
df_report_NB['error'] = 1 - df_report_NB['recall']
df_report_NB['modelo']  = "GaussianNB"
df_report_NB
# Obtener un dataframe a partir de la matriz de confusion
report_NB             = classification_report(y_test, y_pred_NB, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB          = pd.DataFrame(report_NB).T.loc[['EI', 'IE', 'N']]
df_report_NB['error'] = 1 - df_report_NB['recall']
df_report_NB['modelo']  = "GaussianNB"
df_report_NB

Out[ ]:

	precision	recall	f1-score	support	error	modelo
EI	0.942623	0.946502	0.944559	243.0	0.053498	GaussianNB
IE	0.917647	0.947368	0.932271	247.0	0.052632	GaussianNB
N	0.983636	0.967800	0.975654	559.0	0.032200	GaussianNB

In [ ]:

Copied!

# Unir todos los resultados
df_naivebayes_final = pd.concat([df_report_NB_alpha0, df_report_NB_alpha1, df_report_NB]).reset_index().rename(columns={'index': 'clase'})
df_naivebayes_final
# Unir todos los resultados
df_naivebayes_final = pd.concat([df_report_NB_alpha0, df_report_NB_alpha1, df_report_NB]).reset_index().rename(columns={'index': 'clase'})
df_naivebayes_final

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.231649	1.000000	0.376161	243.0	0.000000	Bernoulli B alpha=0
1	IE	0.000000	0.000000	0.000000	247.0	1.000000	Bernoulli B alpha=0
2	N	0.000000	0.000000	0.000000	559.0	1.000000	Bernoulli B alpha=0
3	EI	0.231649	1.000000	0.376161	243.0	0.000000	BernoulliNB alpha=1
4	IE	0.000000	0.000000	0.000000	247.0	1.000000	BernoulliNB alpha=1
5	N	0.000000	0.000000	0.000000	559.0	1.000000	BernoulliNB alpha=1
6	EI	0.942623	0.946502	0.944559	243.0	0.053498	GaussianNB
7	IE	0.917647	0.947368	0.932271	247.0	0.052632	GaussianNB
8	N	0.983636	0.967800	0.975654	559.0	0.032200	GaussianNB

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_naivebayes_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_naivebayes_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar la error por modelo y clase
pivot_error = df_naivebayes_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_naivebayes_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

Comment on Naive Bayes performance¶

As mentioned at the beginning of this section, two variants of Naive Bayes were tested: BernoulliNB (to evaluate Laplace smoothing) and GaussianNB.

The BernoulliNB model, which used data encoded through one-hot encoding, did not produce useful results.
This is because the binary data derived from DNA sequences tend to be sparse and highly correlated, which makes it difficult for the model to capture relevant patterns — especially given that BernoulliNB assumes feature independence.

In contrast, the GaussianNB model, which was fed with data transformed by the encoder, showed very strong performance.
It achieved high and consistent F1-scores across all three classes, as well as very low error rates.

4.3 Implementing Artificial Neural Network (ANN)¶

In this section, an artificial neural network is built with an architecture containing two hidden layers of 100 and p nodes, exploring p = 5, 10, and 20. We start the model with p = 20 and later scale the number of nodes.

In [ ]:

Copied!

# Definir nodo
p = 20
# Definir nodo
p = 20

In [ ]:

Copied!





# Definir la arquitectura de la ANN
model = Sequential([
    Input(shape=(X_train_encoded.shape[1],)),
    Dense(100, activation='relu'),
    Dense(p, activation='relu'),
    Dense(3, activation='softmax') # Ya que se tienen más de 2 categorías
    ])
# Definir la arquitectura de la ANN
model = Sequential([
    Input(shape=(X_train_encoded.shape[1],)),
    Dense(100, activation='relu'),
    Dense(p, activation='relu'),
    Dense(3, activation='softmax') # Ya que se tienen más de 2 categorías
    ])

In [ ]:

Copied!

# Ver detalles del modelo
model.summary()
# Ver detalles del modelo
model.summary()

Model: "sequential_8"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_24 (Dense)                │ (None, 100)            │        48,100 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_25 (Dense)                │ (None, 20)             │         2,020 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_26 (Dense)                │ (None, 3)              │            63 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 50,183 (196.03 KB)

 Trainable params: 50,183 (196.03 KB)

 Non-trainable params: 0 (0.00 B)

In [ ]:

Copied!





# Compilar modelo
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy', # Puesto que las categorías de y_train son enteras [0, 1, 2]
              metrics=['accuracy'])
# Compilar modelo
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy', # Puesto que las categorías de y_train son enteras [0, 1, 2]
              metrics=['accuracy'])

In [ ]:

Copied!





# Definir épocas, tamaño del batch y entrenar el modelo
n_batch  = 32
n_epochs = 20

mfit = model.fit(X_train_encoded, y_train,
          epochs=n_epochs,
          batch_size=n_batch)
# Definir épocas, tamaño del batch y entrenar el modelo
n_batch  = 32
n_epochs = 20

mfit = model.fit(X_train_encoded, y_train,
          epochs=n_epochs,
          batch_size=n_batch)

Epoch 1/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.5665 - loss: 0.9898
Epoch 2/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.8900 - loss: 0.3338
Epoch 3/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - accuracy: 0.9415 - loss: 0.1934
Epoch 4/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9525 - loss: 0.1493
Epoch 5/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9655 - loss: 0.1159
Epoch 6/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9739 - loss: 0.0986
Epoch 7/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9715 - loss: 0.0910
Epoch 8/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9725 - loss: 0.0824
Epoch 9/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9725 - loss: 0.0859
Epoch 10/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9717 - loss: 0.0829
Epoch 11/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9720 - loss: 0.0820
Epoch 12/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9534 - loss: 0.1180
Epoch 13/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9412 - loss: 0.1866
Epoch 14/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9617 - loss: 0.1020
Epoch 15/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9781 - loss: 0.0762
Epoch 16/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9778 - loss: 0.0704
Epoch 17/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9730 - loss: 0.0781
Epoch 18/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9680 - loss: 0.0903
Epoch 19/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9769 - loss: 0.0664
Epoch 20/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9795 - loss: 0.0596

In [ ]:

Copied!





# Evaluar modelo
loss, acc     = model.evaluate(X_test_encoded, y_test)
y_pred        = model.predict(X_test_encoded)
y_pred_labels = y_pred.argmax(axis=1)
# Evaluar modelo
loss, acc     = model.evaluate(X_test_encoded, y_test)
y_pred        = model.predict(X_test_encoded)
y_pred_labels = y_pred.argmax(axis=1)

33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.8993 - loss: 0.2753
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

In [ ]:

Copied!

# Visualizar las primeras predicciones
y_pred_labels[:100]
# Visualizar las primeras predicciones
y_pred_labels[:100]

Out[ ]:

array([1, 1, 2, 2, 2, 2, 2, 1, 0, 0, 1, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 0,
       1, 2, 2, 2, 0, 2, 2, 2, 0, 0, 1, 0, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2,
       2, 2, 2, 1, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 1, 1, 2, 0, 1, 2, 0, 2,
       2, 0, 2, 1, 2, 1, 1, 2, 1, 0, 2, 0, 1, 2, 2, 0, 2, 1, 2, 1, 1, 2,
       2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 0, 1])

In [ ]:

Copied!

# Visualizar los truth labels
y_test[:100]
# Visualizar los truth labels
y_test[:100]

Out[ ]:

array([1, 1, 2, 2, 2, 1, 2, 1, 0, 1, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0, 2, 0,
       1, 2, 2, 2, 1, 2, 2, 2, 0, 0, 1, 0, 2, 2, 1, 2, 0, 2, 2, 1, 2, 1,
       2, 2, 2, 1, 2, 2, 0, 0, 2, 1, 1, 2, 2, 2, 1, 1, 2, 0, 1, 2, 0, 2,
       1, 0, 2, 1, 2, 1, 1, 2, 1, 0, 1, 0, 1, 2, 2, 0, 2, 1, 2, 1, 1, 2,
       1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 0, 1])

In [ ]:

Copied!





# Obtener reporte de clasificación
report              = classification_report(y_test, y_pred_labels, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report           = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error']  = 1 - df_report['recall']
df_report['modelo'] = 'ANN'
df_report['p']      = p
df_report
# Obtener reporte de clasificación
report              = classification_report(y_test, y_pred_labels, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report           = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error']  = 1 - df_report['recall']
df_report['modelo'] = 'ANN'
df_report['p']      = p
df_report

Out[ ]:

	precision	recall	f1-score	support	error	modelo	p
EI	0.932773	0.913580	0.923077	243.0	0.086420	ANN	20
IE	0.982659	0.688259	0.809524	247.0	0.311741	ANN	20
N	0.873041	0.996422	0.930660	559.0	0.003578	ANN	20

In [ ]:

Copied!





# Validar el modelo ANN con distintos valores de p = [5, 10, 20]

# Lista para guardar resultados
resultados_ann = []

for p in [5, 10, 20]:
    print(f"\nEntrenando red con p = {p} nodos en segunda capa oculta...")

    # Definir modelo
    model = Sequential([
        Input(shape=(X_train_encoded.shape[1],)),
        Dense(100, activation='relu'),
        Dense(p, activation='relu'),
        Dense(3, activation='softmax')  # 3 clases
    ])

    # Compilar
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    # Entrenar
    model.fit(X_train_encoded, y_train, epochs=20, batch_size=32)

    # Evaluar
    loss, acc = model.evaluate(X_test_encoded, y_test)
    y_pred = model.predict(X_test_encoded)
    y_pred_labels = y_pred.argmax(axis=1)

    # Reporte de clasificación
    report = classification_report(y_test, y_pred_labels, output_dict=True, target_names=['EI', 'IE', 'N'])
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = 'ANN' + ' p=' + str(p)
    resultados_ann.append(df_report)
# Validar el modelo ANN con distintos valores de p = [5, 10, 20]

# Lista para guardar resultados
resultados_ann = []

for p in [5, 10, 20]:
    print(f"\nEntrenando red con p = {p} nodos en segunda capa oculta...")

    # Definir modelo
    model = Sequential([
        Input(shape=(X_train_encoded.shape[1],)),
        Dense(100, activation='relu'),
        Dense(p, activation='relu'),
        Dense(3, activation='softmax')  # 3 clases
    ])

    # Compilar
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    # Entrenar
    model.fit(X_train_encoded, y_train, epochs=20, batch_size=32)

    # Evaluar
    loss, acc = model.evaluate(X_test_encoded, y_test)
    y_pred = model.predict(X_test_encoded)
    y_pred_labels = y_pred.argmax(axis=1)

    # Reporte de clasificación
    report = classification_report(y_test, y_pred_labels, output_dict=True, target_names=['EI', 'IE', 'N'])
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = 'ANN' + ' p=' + str(p)
    resultados_ann.append(df_report)

Entrenando red con p = 5 nodos en segunda capa oculta...
Epoch 1/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.4817 - loss: 1.0251
Epoch 2/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7174 - loss: 0.7122
Epoch 3/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7311 - loss: 0.6508
Epoch 4/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7394 - loss: 0.6104
Epoch 5/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7355 - loss: 0.5964
Epoch 6/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7436 - loss: 0.5715
Epoch 7/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7389 - loss: 0.5576
Epoch 8/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7482 - loss: 0.5414
Epoch 9/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7461 - loss: 0.5322
Epoch 10/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7450 - loss: 0.5276
Epoch 11/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7442 - loss: 0.5433
Epoch 12/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7369 - loss: 0.5390
Epoch 13/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7341 - loss: 0.5265
Epoch 14/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7325 - loss: 0.5235
Epoch 15/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7423 - loss: 0.5057
Epoch 16/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7481 - loss: 0.4924
Epoch 17/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7528 - loss: 0.4844
Epoch 18/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7776 - loss: 0.3549
Epoch 19/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9420 - loss: 0.1865
Epoch 20/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9622 - loss: 0.1296
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9115 - loss: 0.2400
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Entrenando red con p = 10 nodos en segunda capa oculta...
Epoch 1/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.5541 - loss: 0.9439
Epoch 2/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.8711 - loss: 0.3730
Epoch 3/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9430 - loss: 0.1999
Epoch 4/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9568 - loss: 0.1497
Epoch 5/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9606 - loss: 0.1249
Epoch 6/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9654 - loss: 0.1074
Epoch 7/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9657 - loss: 0.1014
Epoch 8/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9686 - loss: 0.0942
Epoch 9/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9654 - loss: 0.0940
Epoch 10/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9663 - loss: 0.0922
Epoch 11/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9588 - loss: 0.1094
Epoch 12/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9443 - loss: 0.1466
Epoch 13/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9430 - loss: 0.1606
Epoch 14/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9715 - loss: 0.0983
Epoch 15/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9747 - loss: 0.0832
Epoch 16/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9801 - loss: 0.0753
Epoch 17/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9808 - loss: 0.0748
Epoch 18/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9780 - loss: 0.0772
Epoch 19/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9783 - loss: 0.0756
Epoch 20/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9752 - loss: 0.0696
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9418 - loss: 0.1558
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Entrenando red con p = 20 nodos en segunda capa oculta...
Epoch 1/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.5639 - loss: 0.9263
Epoch 2/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9006 - loss: 0.3122
Epoch 3/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9286 - loss: 0.2126
Epoch 4/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9639 - loss: 0.1353
Epoch 5/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9669 - loss: 0.1155
Epoch 6/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.9678 - loss: 0.1092
Epoch 7/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9654 - loss: 0.1084
Epoch 8/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.9652 - loss: 0.1103
Epoch 9/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9578 - loss: 0.1172
Epoch 10/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9519 - loss: 0.1390
Epoch 11/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9445 - loss: 0.1590
Epoch 12/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9648 - loss: 0.1040
Epoch 13/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9733 - loss: 0.0845
Epoch 14/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9793 - loss: 0.0773
Epoch 15/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9798 - loss: 0.0695
Epoch 16/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9836 - loss: 0.0618
Epoch 17/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9857 - loss: 0.0570
Epoch 18/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9824 - loss: 0.0534
Epoch 19/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9822 - loss: 0.0551
Epoch 20/20
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9850 - loss: 0.0539
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9065 - loss: 0.3325
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step

In [ ]:

Copied!





# Concatenar resultados
df_ann_final = pd.concat(resultados_ann).reset_index().rename(columns={'index': 'clase'})

# Mostrar resultados
print("Resultados por clase para cada valor de p:")
df_ann_final
# Concatenar resultados
df_ann_final = pd.concat(resultados_ann).reset_index().rename(columns={'index': 'clase'})

# Mostrar resultados
print("Resultados por clase para cada valor de p:")
df_ann_final

Resultados por clase para cada valor de p:

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.954733	0.954733	0.954733	243.0	0.045267	ANN p=5
1	IE	0.877323	0.955466	0.914729	247.0	0.044534	ANN p=5
2	N	0.990689	0.951699	0.970803	559.0	0.048301	ANN p=5
3	EI	0.881481	0.979424	0.927875	243.0	0.020576	ANN p=10
4	IE	0.906504	0.902834	0.904665	247.0	0.097166	ANN p=10
5	N	0.984991	0.939177	0.961538	559.0	0.060823	ANN p=10
6	EI	0.960870	0.909465	0.934461	243.0	0.090535	ANN p=20
7	IE	0.925620	0.906883	0.916155	247.0	0.093117	ANN p=20
8	N	0.944541	0.974955	0.959507	559.0	0.025045	ANN p=20

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_ann_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_ann_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar la error por modelo y clase
pivot_error = df_ann_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_ann_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

Comment on ANN model performance¶

The artificial neural network (ANN) model shows solid overall performance, with high and consistent f1-scores across all classes. As the number of nodes in the 2nd hidden layer (p) increases, the f1-score remains steady, though without the clear improvements seen in the kNN model relative to k.

Regarding classes, class N performs best, with f1-scores above 0.96 and minimal errors. Class EI follows, while class IE shows an error increase when p rises from 5 to 10, suggesting possible overfitting.

4.4 Implement the Support Vector Machine model¶

In this section, a Support Vector Machine model is built with two configurations: linear kernel and rbf. We begin with the linear kernel implementation.

In [ ]:

Copied!

# Crear modelo SVM
model = SVC(kernel='linear')
# Crear modelo SVM
model = SVC(kernel='linear')

In [ ]:

Copied!

# Entrenar el modelo
model.fit(X_train_encoded, y_train)
# Entrenar el modelo
model.fit(X_train_encoded, y_train)

Out[ ]:

SVC(kernel='linear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

Copied!

# Hacer Predicciones
y_pred = model.predict(X_test_encoded)
# Hacer Predicciones
y_pred = model.predict(X_test_encoded)

In [ ]:

Copied!





# Crear reporte de clasificación
report = classification_report(
    y_test,
    y_pred,
    output_dict=True,
    target_names=['EI', 'IE', 'N']
    )
# Crear reporte de clasificación
report = classification_report(
    y_test,
    y_pred,
    output_dict=True,
    target_names=['EI', 'IE', 'N']
    )

In [ ]:

Copied!





# Crear DataFrame con métricas por clase
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['kernel'] = 'lineal'
df_report
# Crear DataFrame con métricas por clase
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['kernel'] = 'lineal'
df_report

Out[ ]:

	precision	recall	f1-score	support	error	kernel
EI	0.882129	0.954733	0.916996	243.0	0.045267	lineal
IE	0.905738	0.894737	0.900204	247.0	0.105263	lineal
N	0.968635	0.939177	0.953678	559.0	0.060823	lineal

In [ ]:

Copied!





# Explorar las dos opciones de kernel lineal y rbf

# Lista para almacenar resultados
resultados_svm = []

# Definir kernels a evaluar
for kernel in ['linear', 'rbf']:
    print(f"\nEntrenando SVM con kernel = '{kernel}'...")

    # Crear modelo SVM
    model = SVC(kernel=kernel)

    # Entrenar el modelo
    model.fit(X_train_encoded, y_train)

    # Predicciones
    y_pred = model.predict(X_test_encoded)

    # Reporte de clasificación
    report = classification_report(
        y_test,
        y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Crear DataFrame con métricas por clase
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = "SVM " + kernel
    resultados_svm.append(df_report)
# Explorar las dos opciones de kernel lineal y rbf

# Lista para almacenar resultados
resultados_svm = []

# Definir kernels a evaluar
for kernel in ['linear', 'rbf']:
    print(f"\nEntrenando SVM con kernel = '{kernel}'...")

    # Crear modelo SVM
    model = SVC(kernel=kernel)

    # Entrenar el modelo
    model.fit(X_train_encoded, y_train)

    # Predicciones
    y_pred = model.predict(X_test_encoded)

    # Reporte de clasificación
    report = classification_report(
        y_test,
        y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Crear DataFrame con métricas por clase
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = "SVM " + kernel
    resultados_svm.append(df_report)

Entrenando SVM con kernel = 'linear'...

Entrenando SVM con kernel = 'rbf'...

In [ ]:

Copied!





# Concatenar resultados
df_svm_final = pd.concat(resultados_svm).reset_index().rename(columns={'index': 'clase'})

# Mostrar tabla final
print("Resultados por clase para cada kernel:")
df_svm_final
# Concatenar resultados
df_svm_final = pd.concat(resultados_svm).reset_index().rename(columns={'index': 'clase'})

# Mostrar tabla final
print("Resultados por clase para cada kernel:")
df_svm_final

Resultados por clase para cada kernel:

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.882129	0.954733	0.916996	243.0	0.045267	SVM linear
1	IE	0.905738	0.894737	0.900204	247.0	0.105263	SVM linear
2	N	0.968635	0.939177	0.953678	559.0	0.060823	SVM linear
3	EI	0.947791	0.971193	0.959350	243.0	0.028807	SVM rbf
4	IE	0.931727	0.939271	0.935484	247.0	0.060729	SVM rbf
5	N	0.985481	0.971377	0.978378	559.0	0.028623	SVM rbf

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_svm_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_svm_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar la error por modelo y clase
pivot_error = df_svm_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_svm_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

Comment on SVM (Linear and RBF) performance¶

The SVM model with linear kernel provides acceptable performance, with f1-scores above 0.90 across all classes. Class N performs best, while class IE has the most error.

The SVM model with RBF kernel shows better f1-scores (> 0.94) and considerably lower errors. The quality improvement is especially noticeable in EI and IE classes, while N also improves, achieving more balanced and consistent performance.

In summary, the SVM model with RBF kernel outperforms the linear kernel in all relevant metrics, showing better generalization.

4.5 Implement the Decision Tree model¶

In this section, a decision tree model is built with two configurations: boosting and no boosting.

In [ ]:

Copied!





# Definir la función evaluar_model_arbol()

def evaluar_modelo_arbol(boosting=False, max_depth=None):
    # Seleccionar modelo
    if boosting:
        model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=max_depth, random_state=42)
        nombre_modelo = 'Árbol con Boosting'
    else:
        model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
        nombre_modelo = 'Árbol de Decisión'

    # Entrenar
    model.fit(X_train_encoded, y_train)

    # Predecir
    y_pred = model.predict(X_test_encoded)

    # Reporte de clasificación
    report = classification_report(
        y_test, y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Formatear DataFrame
    df_report               = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error']      = 1 - df_report['recall']
    df_report['modelo']     = nombre_modelo + ' depth=' + str(max_depth)

    return df_report
# Definir la función evaluar_model_arbol()

def evaluar_modelo_arbol(boosting=False, max_depth=None):
    # Seleccionar modelo
    if boosting:
        model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=max_depth, random_state=42)
        nombre_modelo = 'Árbol con Boosting'
    else:
        model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
        nombre_modelo = 'Árbol de Decisión'

    # Entrenar
    model.fit(X_train_encoded, y_train)

    # Predecir
    y_pred = model.predict(X_test_encoded)

    # Reporte de clasificación
    report = classification_report(
        y_test, y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Formatear DataFrame
    df_report               = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error']      = 1 - df_report['recall']
    df_report['modelo']     = nombre_modelo + ' depth=' + str(max_depth)

    return df_report

In [ ]:

Copied!





# Evaluar ambos modelos
df_arbol = evaluar_modelo_arbol(boosting=False, max_depth=5)
df_boost = evaluar_modelo_arbol(boosting=True, max_depth=5)

# Combinar resultados
df_arboles_final = pd.concat([df_arbol, df_boost]).reset_index().rename(columns={'index': 'clase'})

# Mostrar
df_arboles_final
# Evaluar ambos modelos
df_arbol = evaluar_modelo_arbol(boosting=False, max_depth=5)
df_boost = evaluar_modelo_arbol(boosting=True, max_depth=5)

# Combinar resultados
df_arboles_final = pd.concat([df_arbol, df_boost]).reset_index().rename(columns={'index': 'clase'})

# Mostrar
df_arboles_final

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.893130	0.962963	0.926733	243.0	0.037037	Árbol de Decisión depth=5
1	IE	0.946429	0.858300	0.900212	247.0	0.141700	Árbol de Decisión depth=5
2	N	0.959147	0.966011	0.962567	559.0	0.033989	Árbol de Decisión depth=5
3	EI	0.932540	0.967078	0.949495	243.0	0.032922	Árbol con Boosting depth=5
4	IE	0.942857	0.935223	0.939024	247.0	0.064777	Árbol con Boosting depth=5
5	N	0.983696	0.971377	0.977498	559.0	0.028623	Árbol con Boosting depth=5

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_arboles_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_arboles_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar la error por modelo y clase
pivot_error = df_arboles_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_arboles_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

Comment on Decision Tree model¶

The decision tree model with depth 5 shows solid performance, with f1-scores above 0.90 for all classes and low errors. Class N stands out, maintaining an f1-score of 0.96, while EI and IE are stable, though IE has a higher error.

Applying Boosting improves all metrics. F1-score increases slightly across all classes, reaching values above 0.94. Errors also decrease, especially for class IE.

In summary, the Boosted Tree offers higher and more consistent performance. Combining Boosting with shallow trees seems to better capture the problem structure without overfitting.

4.6 Implement the Random Forest model with n = 50 and 100¶

In this section, a Random Forest model is built with two numbers of trees: n=50 and n=100.

In [ ]:

Copied!





# Definir la función evaluar_random_forest()
def evaluar_random_forest(n_estimators):
    # Inicializar modelo
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)

    # Entrenar
    model.fit(X_train_encoded, y_train)

    # Predecir
    y_pred = model.predict(X_test_encoded)

    # Obtener reporte por clase
    report = classification_report(
        y_test,
        y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Formatear resultado en DataFrame
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = 'Random Forest' + ' n=' + str(n_estimators)

    return df_report
# Definir la función evaluar_random_forest()
def evaluar_random_forest(n_estimators):
    # Inicializar modelo
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)

    # Entrenar
    model.fit(X_train_encoded, y_train)

    # Predecir
    y_pred = model.predict(X_test_encoded)

    # Obtener reporte por clase
    report = classification_report(
        y_test,
        y_pred,
        output_dict=True,
        target_names=['EI', 'IE', 'N']
    )

    # Formatear resultado en DataFrame
    df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
    df_report['error'] = 1 - df_report['recall']
    df_report['modelo'] = 'Random Forest' + ' n=' + str(n_estimators)

    return df_report

In [ ]:

Copied!





# Evaluar Random Forest con 50 y 100 árboles
df_rf_50  = evaluar_random_forest(n_estimators=50)
df_rf_100 = evaluar_random_forest(n_estimators=100)

# Combinar resultados
df_rf_total = pd.concat([df_rf_50, df_rf_100]).reset_index().rename(columns={'index': 'clase'})

# Mostrar resultados
df_rf_total
# Evaluar Random Forest con 50 y 100 árboles
df_rf_50  = evaluar_random_forest(n_estimators=50)
df_rf_100 = evaluar_random_forest(n_estimators=100)

# Combinar resultados
df_rf_total = pd.concat([df_rf_50, df_rf_100]).reset_index().rename(columns={'index': 'clase'})

# Mostrar resultados
df_rf_total

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.925197	0.967078	0.945674	243.0	0.032922	Random Forest n=50
1	IE	0.941909	0.919028	0.930328	247.0	0.080972	Random Forest n=50
2	N	0.983755	0.974955	0.979335	559.0	0.025045	Random Forest n=50
3	EI	0.928854	0.967078	0.947581	243.0	0.032922	Random Forest n=100
4	IE	0.937759	0.914980	0.926230	247.0	0.085020	Random Forest n=100
5	N	0.980180	0.973166	0.976661	559.0	0.026834	Random Forest n=100

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_rf_total.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_rf_total.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar la error por modelo y clase
pivot_error = df_rf_total.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_rf_total.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

Comment on Random Forest model¶

The Random Forest model with 50 trees shows very competitive performance. F1-scores exceed 0.93 across all classes, particularly class N. Errors are low and fairly balanced, except for class IE, indicating good generalization.

Increasing the number of trees to 100 does not show improvements. F1-scores remain the same, and errors remain low. This suggests that increasing the number of trees could help stabilize and fine-tune predictions, although current performance appears sufficient.

5 Comparison of proposed models' performance¶

In [ ]:

Copied!





# Crear tabla comparativa de todos los modelos implementados
# Combinar resultados
df_resultado_modelos = pd.concat([df_knn_final, df_naivebayes_final, df_ann_final, df_svm_final, df_arboles_final, df_rf_total]).reset_index(drop=True)

# Mostrar resultados
df_resultado_modelos
# Crear tabla comparativa de todos los modelos implementados
# Combinar resultados
df_resultado_modelos = pd.concat([df_knn_final, df_naivebayes_final, df_ann_final, df_svm_final, df_arboles_final, df_rf_total]).reset_index(drop=True)

# Mostrar resultados
df_resultado_modelos

Out[ ]:

	clase	precision	recall	f1-score	support	error	modelo
0	EI	0.633929	0.876543	0.735751	243.0	0.123457	kNN k=1
1	IE	0.662420	0.842105	0.741533	247.0	0.157895	kNN k=1
2	N	0.919799	0.656530	0.766180	559.0	0.343470	kNN k=1
3	EI	0.630986	0.921811	0.749164	243.0	0.078189	kNN k=3
4	IE	0.763699	0.902834	0.827458	247.0	0.097166	kNN k=3
5	N	0.965174	0.694097	0.807492	559.0	0.305903	kNN k=3
6	EI	0.665698	0.942387	0.780239	243.0	0.057613	kNN k=5
7	IE	0.719136	0.943320	0.816112	247.0	0.056680	kNN k=5
8	N	0.989501	0.674419	0.802128	559.0	0.325581	kNN k=5
9	EI	0.728707	0.950617	0.825000	243.0	0.049383	kNN k=7
10	IE	0.750000	0.959514	0.841918	247.0	0.040486	kNN k=7
11	N	0.990385	0.737030	0.845128	559.0	0.262970	kNN k=7
12	EI	0.231649	1.000000	0.376161	243.0	0.000000	Bernoulli B alpha=0
13	IE	0.000000	0.000000	0.000000	247.0	1.000000	Bernoulli B alpha=0
14	N	0.000000	0.000000	0.000000	559.0	1.000000	Bernoulli B alpha=0
15	EI	0.231649	1.000000	0.376161	243.0	0.000000	BernoulliNB alpha=1
16	IE	0.000000	0.000000	0.000000	247.0	1.000000	BernoulliNB alpha=1
17	N	0.000000	0.000000	0.000000	559.0	1.000000	BernoulliNB alpha=1
18	EI	0.958506	0.950617	0.954545	243.0	0.049383	GaussianNB
19	IE	0.905882	0.935223	0.920319	247.0	0.064777	GaussianNB
20	N	0.978300	0.967800	0.973022	559.0	0.032200	GaussianNB
21	EI	0.954733	0.954733	0.954733	243.0	0.045267	ANN p=5
22	IE	0.877323	0.955466	0.914729	247.0	0.044534	ANN p=5
23	N	0.990689	0.951699	0.970803	559.0	0.048301	ANN p=5
24	EI	0.881481	0.979424	0.927875	243.0	0.020576	ANN p=10
25	IE	0.906504	0.902834	0.904665	247.0	0.097166	ANN p=10
26	N	0.984991	0.939177	0.961538	559.0	0.060823	ANN p=10
27	EI	0.960870	0.909465	0.934461	243.0	0.090535	ANN p=20
28	IE	0.925620	0.906883	0.916155	247.0	0.093117	ANN p=20
29	N	0.944541	0.974955	0.959507	559.0	0.025045	ANN p=20
30	EI	0.882129	0.954733	0.916996	243.0	0.045267	SVM linear
31	IE	0.905738	0.894737	0.900204	247.0	0.105263	SVM linear
32	N	0.968635	0.939177	0.953678	559.0	0.060823	SVM linear
33	EI	0.947791	0.971193	0.959350	243.0	0.028807	SVM rbf
34	IE	0.931727	0.939271	0.935484	247.0	0.060729	SVM rbf
35	N	0.985481	0.971377	0.978378	559.0	0.028623	SVM rbf
36	EI	0.893130	0.962963	0.926733	243.0	0.037037	Árbol de Decisión depth=5
37	IE	0.946429	0.858300	0.900212	247.0	0.141700	Árbol de Decisión depth=5
38	N	0.959147	0.966011	0.962567	559.0	0.033989	Árbol de Decisión depth=5
39	EI	0.932540	0.967078	0.949495	243.0	0.032922	Árbol con Boosting depth=5
40	IE	0.942857	0.935223	0.939024	247.0	0.064777	Árbol con Boosting depth=5
41	N	0.983696	0.971377	0.977498	559.0	0.028623	Árbol con Boosting depth=5
42	EI	0.925197	0.967078	0.945674	243.0	0.032922	Random Forest n=50
43	IE	0.941909	0.919028	0.930328	247.0	0.080972	Random Forest n=50
44	N	0.983755	0.974955	0.979335	559.0	0.025045	Random Forest n=50
45	EI	0.928854	0.967078	0.947581	243.0	0.032922	Random Forest n=100
46	IE	0.937759	0.914980	0.926230	247.0	0.085020	Random Forest n=100
47	N	0.980180	0.973166	0.976661	559.0	0.026834	Random Forest n=100

In [ ]:

Copied!





# Graficar el f1-score por modelo y clase
pivot_f1_score = df_resultado_modelos.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_resultado_modelos.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Graficar el error por modelo y clase
pivot_error = df_resultado_modelos.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el error por modelo y clase
pivot_error = df_resultado_modelos.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()

Overall comment on model selection¶

In this project, where the goal is to predict splicing sites (classes EI and IE), it is crucial to choose models that not only achieve good overall f1-scores but are also precise for these classes without confusing non-splicing regions (class N).

The best-performing models for this task are SVM RBF, Random Forest (n>50), and Boosted Tree, as they offer the best balance between EI/IE classes without sacrificing specificity for class N.

ANN models also perform very well, with high and balanced f1-scores. However, GaussianNB offers nearly the same performance with lower computational cost.

Finally, kNN models, although they improve with higher k values, have lower f1-scores than the other models. The BernoulliNB model was completely discarded as it could not capture the data complexity.

Classification of DNA Splice Junctions¶

1 Installation of Dependencies and Setup of the Working Directory¶

2 Data Reading and Preparation¶

2.1 Read Data¶

2.2 How many unique seq_name entries does the dataset have?¶

2.3 How many label types exist?¶

2.4 Prepare the encoded sequences in the dataset¶

2.5 Prepare the dataset labels¶

2.6 Create training and test datasets¶

3 Implementation of the Convolutional Autoencoder¶

4 Application of Algorithms¶

4.1 Implementing the kNN Model for k = 1, 3, 5, 7¶

Comment on kNN performance¶

4.2 Implementing Naive Bayes¶

Comment on Naive Bayes performance¶

4.3 Implementing Artificial Neural Network (ANN)¶

Comment on ANN model performance¶

4.4 Implement the Support Vector Machine model¶

Comment on SVM (Linear and RBF) performance¶

4.5 Implement the Decision Tree model¶

Comment on Decision Tree model¶

4.6 Implement the Random Forest model with n = 50 and 100¶

Comment on Random Forest model¶

5 Comparison of proposed models' performance¶

Overall comment on model selection¶

2.2 How many unique `seq_name` entries does the dataset have?¶