Classification of DNA Splice Junctions¶
by Salomon Marquez
25/06/2025
Splice junctions are points within a DNA sequence where “superfluous” DNA is removed during the protein synthesis process in higher organisms.
The goal of this project is to identify, given a DNA sequence, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are cut out). In the biological community, exon-intron (EI) boundaries are referred to as acceptors, while intron-exon (IE) boundaries are known as donors.
To predict the type of splicing site, several machine learning algorithms will be implemented, including: k-Nearest Neighbour, Naive Bayes, Artificial Neural Network, Support Vector Machine, Decision Tree, and Random Forest. Finally, the following metrics will be evaluated: precision, recall, f1-score, and error, in order to determine which algorithms performed best for this task.
Visit the repository of the project to check out:
- Splice junction data
- Colab notebook
1 Installation of Dependencies and Setup of the Working Directory¶
import numpy as np
import pandas as pd
import os
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Conv1D, Conv1DTranspose, Flatten, Dense, Reshape, Dropout
from IPython.display import Image
# Fijar semilla para obtener resultados reproducibles cuando se ejecuta el notebook
# Fijar la semilla
seed_value = 123
os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)
# Obtener el directorio actual
actual_wd = os.getcwd()
print("Ruta actual:", actual_wd)
# Establecer directorio de trabajo en Gdrive
# os.chdir("/content/drive/MyDrive/ASIGNATURAS/M0.163 MACHINE LEARNING/[28 MAY - 17 JUN] RETO 4/PEC4")
# Visualizar contenido del directorio de trabajo
!ls
autoencoder_image.png enunciado_PEC4_2425_2.pdf splice.csv deep_descriptors.csv PEC4_Machine_Learning.html deep_descriptors.gsheet PEC4_Machine_Learning.ipynb
2 Data Reading and Preparation¶
In this section, we will answer some general questions about the dataset contained in the file splice.csv
2.1 Read Data¶
# Especificar el nombre del archivo de origen
file_name = "splice.csv"
file_path = os.path.join(actual_wd, file_name)
# Guardar en un dataframe el contenido de splice.csv
df_splice = pd.read_csv(file_path, delimiter=',')
# Visualizar contenido del dataframe splice
df_splice.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3190 entries, 0 to 3189 Columns: 482 entries, class to V480 dtypes: int64(480), object(2) memory usage: 11.7+ MB
# Visualizar las 5 primeras filas
df_splice.head(5)
| class | seq_name | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | ... | V471 | V472 | V473 | V474 | V475 | V476 | V477 | V478 | V479 | V480 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EI | ATRINS-DONOR-521 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | EI | ATRINS-DONOR-905 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | EI | BABAPOE-DONOR-30 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | EI | BABAPOE-DONOR-867 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | EI | BABAPOE-DONOR-2817 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 482 columns
# Visualizar las 5 últimas filas
df_splice.tail(5)
| class | seq_name | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | ... | V471 | V472 | V473 | V474 | V475 | V476 | V477 | V478 | V479 | V480 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3185 | N | ORAHBPSBD-NEG-2881 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3186 | N | ORAINVOL-NEG-2161 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3187 | N | ORARGIT-NEG-241 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3188 | N | TARHBB-NEG-541 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3189 | N | TARHBD-NEG-1981 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 482 columns
2.2 How many unique seq_name entries does the dataset have?¶
# Obtener el número de secuencias únicas
n_unique = df_splice['seq_name'].nunique()
n_unique
3178
# Calcular el número de secuencias repetidas
print(f"El dataset contiene {len(df_splice)-n_unique} secuencia repetidas")
El dataset contiene 12 secuencia repetidas
# Alternativamente para calcular duplicados
df_splice.duplicated().sum()
np.int64(12)
# Identificar las secuencias repetidas
df_splice_unique = df_splice['seq_name'].value_counts()
df_splice_unique.head(15)
| count | |
|---|---|
| seq_name | |
| HUMMYC3L-ACCEPTOR-4242 | 2 |
| HUMALBGC-DONOR-17044 | 2 |
| HUMMYLCA-DONOR-2559 | 2 |
| HUMMYLCA-DONOR-2388 | 2 |
| HUMMYLCA-DONOR-1975 | 2 |
| HUMMYLCA-DONOR-952 | 2 |
| HUMALBGC-ACCEPTOR-18496 | 2 |
| HUMMYLCA-ACCEPTOR-924 | 2 |
| HUMMYLCA-ACCEPTOR-1831 | 2 |
| HUMMYLCA-ACCEPTOR-2214 | 2 |
| HUMMYLCA-ACCEPTOR-2481 | 2 |
| HUMMYLCA-DONOR-644 | 2 |
| HUMGAPJR-NEG-961 | 1 |
| HUMGBR-NEG-2521 | 1 |
| HUMGALAB-NEG-901 | 1 |
# Visualizar un par de secuencias repetidas
df_splice[df_splice['seq_name'].str.contains('HUMMYC3L-ACCEPTOR-4242')]
| class | seq_name | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | ... | V471 | V472 | V473 | V474 | V475 | V476 | V477 | V478 | V479 | V480 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1316 | IE | HUMMYC3L-ACCEPTOR-4242 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1317 | IE | HUMMYC3L-ACCEPTOR-4242 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 482 columns
# Eliminar duplicados
df_splice_n = df_splice.drop_duplicates()
print(f"El dataset originalmente contenía {len(df_splice)} registros y después de eliminar duplicados se tienen {len(df_splice_n)} registros ")
El dataset originalmente contenía 3190 registros y después de eliminar duplicados se tienen 3178 registros
2.3 How many label types exist?¶
df_splice_label = df_splice_n['class'].value_counts()
df_splice_label
| count | |
|---|---|
| class | |
| N | 1655 |
| IE | 762 |
| EI | 761 |
We have an imbalanced dataset where slightly more than 50% of the sequences belong to the “no splicing” category. This must be taken into account because the classification models proposed below could become biased toward predicting "N."
2.4 Prepare the encoded sequences in the dataset¶
Here we define the variable seq_encoded, which will contain the 3,178 records and their 480 features.
# Seleccionar desde V1 hasta V480
seq_encoded = df_splice_n.iloc[:,2:]
# Verificar el tipo de variable
type(seq_encoded)
pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. If data is a dict containing one or more Series (possibly of different dtypes), ``copy=False`` will ensure that these inputs are not copied. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Notes ----- Please reference the :ref:`User Guide <basics.dataframe>` for more information. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 Constructing DataFrame from Series/DataFrame: >>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3 >>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3
# Convertir datos a numpy.array
seq_encoded_array = seq_encoded.to_numpy()
seq_encoded_array[:5]
array([[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0]])
2.5 Prepare the dataset labels¶
The Class variable in the splicing dataset has three categories: EI, IE, and N. No particular positive class is defined here. This variable must be converted to a numeric type so it can be used in our classification models.
labels = df_splice_n['class'].map({'EI': 0, 'IE': 1, 'N': 2})
# Convertir datos a numpy.array
labels_array = labels.to_numpy()
labels_array[:5]
array([0, 0, 0, 0, 0])
2.6 Create training and test datasets¶
# Comprobar que las variables y los labels para entrenar los modelos de clasificación sean tipo array
print(f"La secuencias codificadas por one-hot encoding son de tipo {type(seq_encoded_array)} con dimensión {seq_encoded_array.shape}\n"
f"y la variable label es de tipo {type(labels_array)} con dimensión {labels_array.shape}")
La secuencias codificadas por one-hot encoding son de tipo <class 'numpy.ndarray'> con dimensión (3178, 480) y la variable label es de tipo <class 'numpy.ndarray'> con dimensión (3178,)
# Crear train y test datasets
X_train, X_test, y_train, y_test = train_test_split(
seq_encoded_array,
labels_array,
test_size = 0.33, # 33% para el dataset test como lo indica el enunciado de la PEC4
random_state= 123 # Fijamos la semilla aleatoria en 123
)
# Visualizar la dimensiones de los 4 datases creados a partir de train_test_split()
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((2129, 480), (2129,), (1049, 480), (1049,))
3 Implementation of the Convolutional Autoencoder¶
In this section, we build an autoencoder with the architecture shown in the figure, considering the purpose of each component as described below:
The encoder is a regular CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing the depth (i.e., the number of feature maps). The decoder must do the reverse (upscale the image and reduce its depth back to the original dimensions), and for this you can use transpose convolutional layers (alternatively, you could combine upsampling layers with convolutional layers). Texto extraído de Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition sección Convolutional Autoencoders.
Because DNA sequences are linear — an ordered chain of nucleotides — we use Conv1D to detect local motifs (such as splicing patterns).
The only way to train the encoder to generate useful representations is to force it to reconstruct the original input — this is where the decoder comes in. Without a decoder, the encoder would not learn anything meaningful, because there would be no metric to tell whether the compression preserves useful information. Once the decoder is trained, we can use the encoder alone for classification models.
Image("autoencoder_image.png",width=600, height=400)
# Reajustar las dimensiones de X_train puesto que Conv1D requiere datos en formato 3D
m, n = X_train.shape # Dimensiones del dataset train
p, r = X_test.shape # Dimensiones del dataset test
X_train_reshaped = X_train.reshape((m, 60, 8))
X_test_reshaped = X_test.reshape((p, 60, 8))
# AUTOENCODER
# Definir capa de entrada
input_layer = Input(shape=(60, 8))
# Definir encoder
x = Conv1D(filters=8, kernel_size=3, activation='relu', padding='same')(input_layer)
bottleneck = Flatten()(x)
#bottleneck = Dense(32, activation='relu')(x)
# Definir decoder
#x = Dense(60 * 8, activation='relu')(bottleneck)
x = Reshape((60, 8))(bottleneck)
output_layer = Conv1DTranspose(filters=8, kernel_size=3, activation='sigmoid', padding='same')(x)
# Definir modelo
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
# Entrenar autoencoder
history = autoencoder.fit(X_train_reshaped, X_train_reshaped, epochs=20, batch_size=32, validation_split=0.2)
Epoch 1/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 2s 13ms/step - loss: 0.7044 - val_loss: 0.6261 Epoch 2/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.5937 - val_loss: 0.4684 Epoch 3/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.4215 - val_loss: 0.3057 Epoch 4/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.2815 - val_loss: 0.2242 Epoch 5/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.2105 - val_loss: 0.1758 Epoch 6/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1660 - val_loss: 0.1404 Epoch 7/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.1325 - val_loss: 0.1122 Epoch 8/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1057 - val_loss: 0.0897 Epoch 9/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0842 - val_loss: 0.0718 Epoch 10/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0673 - val_loss: 0.0577 Epoch 11/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0539 - val_loss: 0.0466 Epoch 12/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0435 - val_loss: 0.0380 Epoch 13/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0354 - val_loss: 0.0313 Epoch 14/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.0291 - val_loss: 0.0261 Epoch 15/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0241 - val_loss: 0.0220 Epoch 16/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0202 - val_loss: 0.0187 Epoch 17/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0172 - val_loss: 0.0162 Epoch 18/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0147 - val_loss: 0.0141 Epoch 19/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0127 - val_loss: 0.0124 Epoch 20/20 54/54 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.0111 - val_loss: 0.0110
# Evaluar la pérdida de reconstrucción del autoencoder
train_loss = autoencoder.evaluate(X_train_reshaped, X_train_reshaped)
test_loss = autoencoder.evaluate(X_test_reshaped, X_test_reshaped)
print(f"\nTrain loss: {train_loss:.4f}")
print(f"Test loss: {test_loss:.4f}")
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.0101 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0101 Train loss: 0.0103 Test loss: 0.0101
# Visualizar la curva de entrenamiento del autoencoder
plt.plot(history.history['loss'], label='Train loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Pérdida de entrenamiento vs validación')
plt.show()
We observe that the autoencoder quality is good since the reconstruction loss is very low.
# Extraer la parte codificadora "encoder" para usarla en otros modelos
encoder = Model(inputs=input_layer, outputs=bottleneck)
X_train_encoded = encoder.predict(X_train_reshaped)
67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
# Verificar las dimensiones de las nuevas coordenadas de X_train_encoded
X_train_encoded.shape, X_train.shape
((2129, 480), (2129, 480))
# Visualizar X_train_encoded
print(X_train_encoded)
[[0.41865867 1.1792793 0. ... 1.0533024 1.2378863 1.4537132 ] [0.97194517 1.2243328 0.7337566 ... 1.6940844 0.6968196 2.2641408 ] [0.7622293 0.27394313 1.5169445 ... 1.1491497 1.7781353 0.6089502 ] ... [0. 0.19901627 1.6442008 ... 0.92114854 1.2606764 0. ] [0.55828255 1.0124682 0.6177381 ... 1.4002684 0.5707153 1.0705616 ] [0.18931112 0.25097656 1.0731099 ... 2.1333911 0.10522419 1.4503322 ]]
# Obtener las nuevas coordenadas de X_test
X_test_encoded = encoder.predict(X_test_reshaped)
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
# Verificar las dimensiones de las nuevas coordenadas de X_test_encoded
X_test_encoded.shape, X_test.shape
((1049, 480), (1049, 480))
# Visualizar X_test_encoded
print(X_test_encoded)
[[0.18931112 0.25097656 1.0731099 ... 0.8388007 1.4651122 1.0704793 ] [0.55828255 1.0124682 0.6177381 ... 1.4002684 0.5707153 1.0705616 ] [0.55828255 1.0124682 0.6177381 ... 0.8388007 1.4651122 1.0704793 ] ... [0.34228757 1.3086706 0.33683705 ... 1.0533024 1.2378863 1.4537132 ] [1.4116294 0.32590342 0.94585365 ... 1.0568092 1.7025597 1.0396074 ] [0.16956863 0.1146785 2.0411205 ... 1.7764323 0.49238378 1.1906232 ]]
4 Application of Algorithms¶
4.1 Implementing the kNN Model for k = 1, 3, 5, 7¶
To generate the kNN model predictions based on the value of k, we create a function that performs the following main tasks:
- Build the model
- Train the model
- Make predictions
- Calculate performance metrics
The function takes k, X_train_encoded, y_train, X_test_encoded, and y_test as input arguments and outputs a metrics dataframe. The model for k=3 is created first.
# Crear modelo k-nn con k = 3
k = 3 # Valor del nearest neighbor k = [1, 3, 5, 7]
model = KNeighborsClassifier(n_neighbors=k, metric='euclidean') # Distancia euclidiana
# Entrenar modelo k-nn
model.fit(X_train_encoded, y_train)
# Hacer predicciones con el modelo k-nn entrenado usando X_test
y_pred = model.predict(X_test_encoded)
# Imprimir reporte de clasificación
# Recordar que {'EI': 0, 'IE': 1, 'N': 2}
print(classification_report(y_test, y_pred, target_names=['Clase EI', 'Clase IE', 'Clase N']))
precision recall f1-score support
Clase EI 0.61 0.93 0.74 243
Clase IE 0.78 0.87 0.82 247
Clase N 0.96 0.70 0.81 559
accuracy 0.79 1049
macro avg 0.78 0.83 0.79 1049
weighted avg 0.83 0.79 0.79 1049
# Obtener un dataframe a partir de la matriz de confusion
report = classification_report(y_test, y_pred, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['k'] = k
df_report['modelo'] = "kNN"
df_report.reset_index(inplace=True)
df_report.rename(columns={'index': 'clase'}, inplace=True)
df_report
| clase | precision | recall | f1-score | support | error | k | modelo | |
|---|---|---|---|---|---|---|---|---|
| 0 | EI | 0.614754 | 0.925926 | 0.738916 | 243.0 | 0.074074 | 3 | kNN |
| 1 | IE | 0.775362 | 0.866397 | 0.818356 | 247.0 | 0.133603 | 3 | kNN |
| 2 | N | 0.955774 | 0.695886 | 0.805383 | 559.0 | 0.304114 | 3 | kNN |
# Definir función evaluar_knn_por_k()
def evaluar_knn_por_k(k, X_train, y_train, X_test, y_test):
"""
Entrena y evalúa un modelo k-NN con valor k dado.
Retorna un DataFrame con precision, recall, f1-score, error por clase y k.
"""
# Entrenar modelo
model = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
model.fit(X_train, y_train)
# Predicción
y_pred = model.predict(X_test)
# Clasification report por clase
report = classification_report(
y_test,
y_pred,
output_dict=True,
target_names=['EI', 'IE', 'N']
)
# Convertir a DataFrame
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['modelo'] = 'kNN' + ' k=' + str(k)
df_report.reset_index(inplace=True)
df_report.rename(columns={'index': 'clase'}, inplace=True)
return df_report
# Probar con diferentes valores de k
resultados = []
for k in [1, 3, 5, 7]:
df_k = evaluar_knn_por_k(k, X_train_encoded, y_train, X_test_encoded, y_test)
resultados.append(df_k)
# Unir todos los resultados
df_knn_final = pd.concat(resultados, ignore_index=True)
df_knn_final
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.638554 | 0.872428 | 0.737391 | 243.0 | 0.127572 | kNN k=1 |
| 1 | IE | 0.667722 | 0.854251 | 0.749556 | 247.0 | 0.145749 | kNN k=1 |
| 2 | N | 0.922693 | 0.661896 | 0.770833 | 559.0 | 0.338104 | kNN k=1 |
| 3 | EI | 0.614754 | 0.925926 | 0.738916 | 243.0 | 0.074074 | kNN k=3 |
| 4 | IE | 0.775362 | 0.866397 | 0.818356 | 247.0 | 0.133603 | kNN k=3 |
| 5 | N | 0.955774 | 0.695886 | 0.805383 | 559.0 | 0.304114 | kNN k=3 |
| 6 | EI | 0.684524 | 0.946502 | 0.794473 | 243.0 | 0.053498 | kNN k=5 |
| 7 | IE | 0.751656 | 0.919028 | 0.826958 | 247.0 | 0.080972 | kNN k=5 |
| 8 | N | 0.975669 | 0.717352 | 0.826804 | 559.0 | 0.282648 | kNN k=5 |
| 9 | EI | 0.707317 | 0.954733 | 0.812609 | 243.0 | 0.045267 | kNN k=7 |
| 10 | IE | 0.759868 | 0.935223 | 0.838475 | 247.0 | 0.064777 | kNN k=7 |
| 11 | N | 0.980815 | 0.731664 | 0.838115 | 559.0 | 0.268336 | kNN k=7 |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_knn_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_knn_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
### 4.1 Implementar el modelo kNN para k = 1, 3, 5 y 7
plt.show()
Comment on kNN performance¶
The evaluation strategy for the proposed models is the following:
- Focus on F1-score because it balances precision and recall, which is important for multiclass classification.
- Study the error.
- Evaluate per-class performance to ensure the model is not just optimizing for class N.
kNN shows progressive improvement in F1-score as k increases, meaning it becomes more robust as more neighbors are considered. The error decreases accordingly.
Class-by-class F1-scores reveal that IE and N perform more consistently, while EI is harder to identify correctly. Interestingly, class N shows the highest error, suggesting room for improvement.
4.2 Implementing Naive Bayes¶
To evaluate the performance of the Naive Bayes model depending on the type of data in Python, two approaches are proposed.
First, the BernoulliNB classifier with Laplace smoothing (alpha = 1 in Python) is applied to binary data obtained through one-hot encoding.
Next, GaussianNB is used on continuous data derived from the convolutional autoencoder, since this model assumes a normal distribution of the variables.
This strategy allows us to compare the impact of different input transformations on the performance of the Naive Bayes classifier.
For more information on implementing the Naive Bayes model, refer to
BernoulliNB and
GaussianNB.
# Entrenar BernoulliNB sin suavizado (alpha=0) con datos one-hot encoded
modelNB_alpha0 = BernoulliNB(alpha=0.0)
modelNB_alpha0.fit(X_train, y_train)
# Predecir
y_pred_NB_alpha0 = modelNB_alpha0.predict(X_test)
print("Predicciones:", y_pred_NB_alpha0)
Predicciones: [0 0 0 ... 0 0 0]
/usr/local/lib/python3.11/dist-packages/sklearn/naive_bayes.py:1209: RuntimeWarning: divide by zero encountered in log self.feature_log_prob_ = np.log(smoothed_fc) - np.log( /usr/local/lib/python3.11/dist-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul ret = a @ b
# Obtener un dataframe a partir de la matriz de confusion
report_NB_alpha0 = classification_report(y_test, y_pred_NB_alpha0, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB_alpha0 = pd.DataFrame(report_NB_alpha0 ).T.loc[['EI', 'IE', 'N']]
df_report_NB_alpha0 ['error'] = 1 - df_report_NB_alpha0 ['recall']
df_report_NB_alpha0 ['modelo'] = "Bernoulli B alpha=0"
df_report_NB_alpha0
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
| precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|
| EI | 0.231649 | 1.0 | 0.376161 | 243.0 | 0.0 | Bernoulli B alpha=0 |
| IE | 0.000000 | 0.0 | 0.000000 | 247.0 | 1.0 | Bernoulli B alpha=0 |
| N | 0.000000 | 0.0 | 0.000000 | 559.0 | 1.0 | Bernoulli B alpha=0 |
# Entrenar BernoulliNB con suavizado (alpha=1) con datos one-hot encoded
modelNB_alpha1 = BernoulliNB(alpha=1.0)
modelNB_alpha1.fit(X_train, y_train)
# Predecir
y_pred_NB_alpha1 = modelNB_alpha0.predict(X_test)
print("Predicciones:", y_pred_NB_alpha1)
Predicciones: [0 0 0 ... 0 0 0]
/usr/local/lib/python3.11/dist-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul ret = a @ b
# Obtener un dataframe a partir de la matriz de confusion
report_NB_alpha1 = classification_report(y_test, y_pred_NB_alpha1, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB_alpha1 = pd.DataFrame(report_NB_alpha1).T.loc[['EI', 'IE', 'N']]
df_report_NB_alpha1['error'] = 1 - df_report_NB_alpha1['recall']
df_report_NB_alpha1['modelo'] = "BernoulliNB alpha=1"
df_report_NB_alpha1
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
| precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|
| EI | 0.231649 | 1.0 | 0.376161 | 243.0 | 0.0 | BernoulliNB alpha=1 |
| IE | 0.000000 | 0.0 | 0.000000 | 247.0 | 1.0 | BernoulliNB alpha=1 |
| N | 0.000000 | 0.0 | 0.000000 | 559.0 | 1.0 | BernoulliNB alpha=1 |
# Entrenar GaussianNB con datos del autoencoder
modelNB = GaussianNB()
modelNB.fit(X_train_encoded, y_train)
# Predecir
y_pred_NB = modelNB.predict(X_test_encoded)
print("Predicciones:", y_pred_NB)
Predicciones: [1 1 2 ... 0 2 1]
# Obtener un dataframe a partir de la matriz de confusion
report_NB = classification_report(y_test, y_pred_NB, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report_NB = pd.DataFrame(report_NB).T.loc[['EI', 'IE', 'N']]
df_report_NB['error'] = 1 - df_report_NB['recall']
df_report_NB['modelo'] = "GaussianNB"
df_report_NB
| precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|
| EI | 0.942623 | 0.946502 | 0.944559 | 243.0 | 0.053498 | GaussianNB |
| IE | 0.917647 | 0.947368 | 0.932271 | 247.0 | 0.052632 | GaussianNB |
| N | 0.983636 | 0.967800 | 0.975654 | 559.0 | 0.032200 | GaussianNB |
# Unir todos los resultados
df_naivebayes_final = pd.concat([df_report_NB_alpha0, df_report_NB_alpha1, df_report_NB]).reset_index().rename(columns={'index': 'clase'})
df_naivebayes_final
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.231649 | 1.000000 | 0.376161 | 243.0 | 0.000000 | Bernoulli B alpha=0 |
| 1 | IE | 0.000000 | 0.000000 | 0.000000 | 247.0 | 1.000000 | Bernoulli B alpha=0 |
| 2 | N | 0.000000 | 0.000000 | 0.000000 | 559.0 | 1.000000 | Bernoulli B alpha=0 |
| 3 | EI | 0.231649 | 1.000000 | 0.376161 | 243.0 | 0.000000 | BernoulliNB alpha=1 |
| 4 | IE | 0.000000 | 0.000000 | 0.000000 | 247.0 | 1.000000 | BernoulliNB alpha=1 |
| 5 | N | 0.000000 | 0.000000 | 0.000000 | 559.0 | 1.000000 | BernoulliNB alpha=1 |
| 6 | EI | 0.942623 | 0.946502 | 0.944559 | 243.0 | 0.053498 | GaussianNB |
| 7 | IE | 0.917647 | 0.947368 | 0.932271 | 247.0 | 0.052632 | GaussianNB |
| 8 | N | 0.983636 | 0.967800 | 0.975654 | 559.0 | 0.032200 | GaussianNB |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_naivebayes_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_naivebayes_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
Comment on Naive Bayes performance¶
As mentioned at the beginning of this section, two variants of Naive Bayes were tested: BernoulliNB (to evaluate Laplace smoothing) and GaussianNB.
The BernoulliNB model, which used data encoded through one-hot encoding, did not produce useful results.
This is because the binary data derived from DNA sequences tend to be sparse and highly correlated, which makes it difficult for the model to capture relevant patterns — especially given that BernoulliNB assumes feature independence.
In contrast, the GaussianNB model, which was fed with data transformed by the encoder, showed very strong performance.
It achieved high and consistent F1-scores across all three classes, as well as very low error rates.
4.3 Implementing Artificial Neural Network (ANN)¶
In this section, an artificial neural network is built with an architecture containing two hidden layers of 100 and p nodes, exploring p = 5, 10, and 20. We start the model with p = 20 and later scale the number of nodes.
# Definir nodo
p = 20
# Definir la arquitectura de la ANN
model = Sequential([
Input(shape=(X_train_encoded.shape[1],)),
Dense(100, activation='relu'),
Dense(p, activation='relu'),
Dense(3, activation='softmax') # Ya que se tienen más de 2 categorías
])
# Ver detalles del modelo
model.summary()
Model: "sequential_8"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense_24 (Dense) │ (None, 100) │ 48,100 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_25 (Dense) │ (None, 20) │ 2,020 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_26 (Dense) │ (None, 3) │ 63 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 50,183 (196.03 KB)
Trainable params: 50,183 (196.03 KB)
Non-trainable params: 0 (0.00 B)
# Compilar modelo
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', # Puesto que las categorías de y_train son enteras [0, 1, 2]
metrics=['accuracy'])
# Definir épocas, tamaño del batch y entrenar el modelo
n_batch = 32
n_epochs = 20
mfit = model.fit(X_train_encoded, y_train,
epochs=n_epochs,
batch_size=n_batch)
Epoch 1/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.5665 - loss: 0.9898 Epoch 2/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.8900 - loss: 0.3338 Epoch 3/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - accuracy: 0.9415 - loss: 0.1934 Epoch 4/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9525 - loss: 0.1493 Epoch 5/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9655 - loss: 0.1159 Epoch 6/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9739 - loss: 0.0986 Epoch 7/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9715 - loss: 0.0910 Epoch 8/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9725 - loss: 0.0824 Epoch 9/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9725 - loss: 0.0859 Epoch 10/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9717 - loss: 0.0829 Epoch 11/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9720 - loss: 0.0820 Epoch 12/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9534 - loss: 0.1180 Epoch 13/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9412 - loss: 0.1866 Epoch 14/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9617 - loss: 0.1020 Epoch 15/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9781 - loss: 0.0762 Epoch 16/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9778 - loss: 0.0704 Epoch 17/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9730 - loss: 0.0781 Epoch 18/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9680 - loss: 0.0903 Epoch 19/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9769 - loss: 0.0664 Epoch 20/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9795 - loss: 0.0596
# Evaluar modelo
loss, acc = model.evaluate(X_test_encoded, y_test)
y_pred = model.predict(X_test_encoded)
y_pred_labels = y_pred.argmax(axis=1)
33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.8993 - loss: 0.2753 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
# Visualizar las primeras predicciones
y_pred_labels[:100]
array([1, 1, 2, 2, 2, 2, 2, 1, 0, 0, 1, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 0,
1, 2, 2, 2, 0, 2, 2, 2, 0, 0, 1, 0, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2,
2, 2, 2, 1, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 1, 1, 2, 0, 1, 2, 0, 2,
2, 0, 2, 1, 2, 1, 1, 2, 1, 0, 2, 0, 1, 2, 2, 0, 2, 1, 2, 1, 1, 2,
2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 0, 1])
# Visualizar los truth labels
y_test[:100]
array([1, 1, 2, 2, 2, 1, 2, 1, 0, 1, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0, 2, 0,
1, 2, 2, 2, 1, 2, 2, 2, 0, 0, 1, 0, 2, 2, 1, 2, 0, 2, 2, 1, 2, 1,
2, 2, 2, 1, 2, 2, 0, 0, 2, 1, 1, 2, 2, 2, 1, 1, 2, 0, 1, 2, 0, 2,
1, 0, 2, 1, 2, 1, 1, 2, 1, 0, 1, 0, 1, 2, 2, 0, 2, 1, 2, 1, 1, 2,
1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 0, 1])
# Obtener reporte de clasificación
report = classification_report(y_test, y_pred_labels, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['modelo'] = 'ANN'
df_report['p'] = p
df_report
| precision | recall | f1-score | support | error | modelo | p | |
|---|---|---|---|---|---|---|---|
| EI | 0.932773 | 0.913580 | 0.923077 | 243.0 | 0.086420 | ANN | 20 |
| IE | 0.982659 | 0.688259 | 0.809524 | 247.0 | 0.311741 | ANN | 20 |
| N | 0.873041 | 0.996422 | 0.930660 | 559.0 | 0.003578 | ANN | 20 |
# Validar el modelo ANN con distintos valores de p = [5, 10, 20]
# Lista para guardar resultados
resultados_ann = []
for p in [5, 10, 20]:
print(f"\nEntrenando red con p = {p} nodos en segunda capa oculta...")
# Definir modelo
model = Sequential([
Input(shape=(X_train_encoded.shape[1],)),
Dense(100, activation='relu'),
Dense(p, activation='relu'),
Dense(3, activation='softmax') # 3 clases
])
# Compilar
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Entrenar
model.fit(X_train_encoded, y_train, epochs=20, batch_size=32)
# Evaluar
loss, acc = model.evaluate(X_test_encoded, y_test)
y_pred = model.predict(X_test_encoded)
y_pred_labels = y_pred.argmax(axis=1)
# Reporte de clasificación
report = classification_report(y_test, y_pred_labels, output_dict=True, target_names=['EI', 'IE', 'N'])
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['modelo'] = 'ANN' + ' p=' + str(p)
resultados_ann.append(df_report)
Entrenando red con p = 5 nodos en segunda capa oculta... Epoch 1/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.4817 - loss: 1.0251 Epoch 2/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7174 - loss: 0.7122 Epoch 3/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7311 - loss: 0.6508 Epoch 4/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7394 - loss: 0.6104 Epoch 5/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7355 - loss: 0.5964 Epoch 6/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7436 - loss: 0.5715 Epoch 7/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7389 - loss: 0.5576 Epoch 8/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7482 - loss: 0.5414 Epoch 9/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7461 - loss: 0.5322 Epoch 10/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7450 - loss: 0.5276 Epoch 11/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7442 - loss: 0.5433 Epoch 12/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7369 - loss: 0.5390 Epoch 13/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7341 - loss: 0.5265 Epoch 14/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7325 - loss: 0.5235 Epoch 15/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7423 - loss: 0.5057 Epoch 16/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7481 - loss: 0.4924 Epoch 17/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7528 - loss: 0.4844 Epoch 18/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.7776 - loss: 0.3549 Epoch 19/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9420 - loss: 0.1865 Epoch 20/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9622 - loss: 0.1296 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9115 - loss: 0.2400 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Entrenando red con p = 10 nodos en segunda capa oculta... Epoch 1/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.5541 - loss: 0.9439 Epoch 2/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.8711 - loss: 0.3730 Epoch 3/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9430 - loss: 0.1999 Epoch 4/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9568 - loss: 0.1497 Epoch 5/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9606 - loss: 0.1249 Epoch 6/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9654 - loss: 0.1074 Epoch 7/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9657 - loss: 0.1014 Epoch 8/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9686 - loss: 0.0942 Epoch 9/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9654 - loss: 0.0940 Epoch 10/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9663 - loss: 0.0922 Epoch 11/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9588 - loss: 0.1094 Epoch 12/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9443 - loss: 0.1466 Epoch 13/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9430 - loss: 0.1606 Epoch 14/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9715 - loss: 0.0983 Epoch 15/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9747 - loss: 0.0832 Epoch 16/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9801 - loss: 0.0753 Epoch 17/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9808 - loss: 0.0748 Epoch 18/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9780 - loss: 0.0772 Epoch 19/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9783 - loss: 0.0756 Epoch 20/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9752 - loss: 0.0696 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9418 - loss: 0.1558 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Entrenando red con p = 20 nodos en segunda capa oculta... Epoch 1/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.5639 - loss: 0.9263 Epoch 2/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9006 - loss: 0.3122 Epoch 3/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9286 - loss: 0.2126 Epoch 4/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9639 - loss: 0.1353 Epoch 5/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9669 - loss: 0.1155 Epoch 6/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.9678 - loss: 0.1092 Epoch 7/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9654 - loss: 0.1084 Epoch 8/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.9652 - loss: 0.1103 Epoch 9/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9578 - loss: 0.1172 Epoch 10/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9519 - loss: 0.1390 Epoch 11/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9445 - loss: 0.1590 Epoch 12/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9648 - loss: 0.1040 Epoch 13/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9733 - loss: 0.0845 Epoch 14/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9793 - loss: 0.0773 Epoch 15/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9798 - loss: 0.0695 Epoch 16/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9836 - loss: 0.0618 Epoch 17/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9857 - loss: 0.0570 Epoch 18/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9824 - loss: 0.0534 Epoch 19/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9822 - loss: 0.0551 Epoch 20/20 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9850 - loss: 0.0539 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9065 - loss: 0.3325 33/33 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
# Concatenar resultados
df_ann_final = pd.concat(resultados_ann).reset_index().rename(columns={'index': 'clase'})
# Mostrar resultados
print("Resultados por clase para cada valor de p:")
df_ann_final
Resultados por clase para cada valor de p:
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.954733 | 0.954733 | 0.954733 | 243.0 | 0.045267 | ANN p=5 |
| 1 | IE | 0.877323 | 0.955466 | 0.914729 | 247.0 | 0.044534 | ANN p=5 |
| 2 | N | 0.990689 | 0.951699 | 0.970803 | 559.0 | 0.048301 | ANN p=5 |
| 3 | EI | 0.881481 | 0.979424 | 0.927875 | 243.0 | 0.020576 | ANN p=10 |
| 4 | IE | 0.906504 | 0.902834 | 0.904665 | 247.0 | 0.097166 | ANN p=10 |
| 5 | N | 0.984991 | 0.939177 | 0.961538 | 559.0 | 0.060823 | ANN p=10 |
| 6 | EI | 0.960870 | 0.909465 | 0.934461 | 243.0 | 0.090535 | ANN p=20 |
| 7 | IE | 0.925620 | 0.906883 | 0.916155 | 247.0 | 0.093117 | ANN p=20 |
| 8 | N | 0.944541 | 0.974955 | 0.959507 | 559.0 | 0.025045 | ANN p=20 |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_ann_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_ann_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
Comment on ANN model performance¶
The artificial neural network (ANN) model shows solid overall performance, with high and consistent f1-scores across all classes. As the number of nodes in the 2nd hidden layer (p) increases, the f1-score remains steady, though without the clear improvements seen in the kNN model relative to k.
Regarding classes, class N performs best, with f1-scores above 0.96 and minimal errors. Class EI follows, while class IE shows an error increase when p rises from 5 to 10, suggesting possible overfitting.
4.4 Implement the Support Vector Machine model¶
In this section, a Support Vector Machine model is built with two configurations: linear kernel and rbf. We begin with the linear kernel implementation.
# Crear modelo SVM
model = SVC(kernel='linear')
# Entrenar el modelo
model.fit(X_train_encoded, y_train)
SVC(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear')
# Hacer Predicciones
y_pred = model.predict(X_test_encoded)
# Crear reporte de clasificación
report = classification_report(
y_test,
y_pred,
output_dict=True,
target_names=['EI', 'IE', 'N']
)
# Crear DataFrame con métricas por clase
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['kernel'] = 'lineal'
df_report
| precision | recall | f1-score | support | error | kernel | |
|---|---|---|---|---|---|---|
| EI | 0.882129 | 0.954733 | 0.916996 | 243.0 | 0.045267 | lineal |
| IE | 0.905738 | 0.894737 | 0.900204 | 247.0 | 0.105263 | lineal |
| N | 0.968635 | 0.939177 | 0.953678 | 559.0 | 0.060823 | lineal |
# Explorar las dos opciones de kernel lineal y rbf
# Lista para almacenar resultados
resultados_svm = []
# Definir kernels a evaluar
for kernel in ['linear', 'rbf']:
print(f"\nEntrenando SVM con kernel = '{kernel}'...")
# Crear modelo SVM
model = SVC(kernel=kernel)
# Entrenar el modelo
model.fit(X_train_encoded, y_train)
# Predicciones
y_pred = model.predict(X_test_encoded)
# Reporte de clasificación
report = classification_report(
y_test,
y_pred,
output_dict=True,
target_names=['EI', 'IE', 'N']
)
# Crear DataFrame con métricas por clase
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['modelo'] = "SVM " + kernel
resultados_svm.append(df_report)
Entrenando SVM con kernel = 'linear'... Entrenando SVM con kernel = 'rbf'...
# Concatenar resultados
df_svm_final = pd.concat(resultados_svm).reset_index().rename(columns={'index': 'clase'})
# Mostrar tabla final
print("Resultados por clase para cada kernel:")
df_svm_final
Resultados por clase para cada kernel:
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.882129 | 0.954733 | 0.916996 | 243.0 | 0.045267 | SVM linear |
| 1 | IE | 0.905738 | 0.894737 | 0.900204 | 247.0 | 0.105263 | SVM linear |
| 2 | N | 0.968635 | 0.939177 | 0.953678 | 559.0 | 0.060823 | SVM linear |
| 3 | EI | 0.947791 | 0.971193 | 0.959350 | 243.0 | 0.028807 | SVM rbf |
| 4 | IE | 0.931727 | 0.939271 | 0.935484 | 247.0 | 0.060729 | SVM rbf |
| 5 | N | 0.985481 | 0.971377 | 0.978378 | 559.0 | 0.028623 | SVM rbf |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_svm_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_svm_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
Comment on SVM (Linear and RBF) performance¶
The SVM model with linear kernel provides acceptable performance, with f1-scores above 0.90 across all classes. Class N performs best, while class IE has the most error.
The SVM model with RBF kernel shows better f1-scores (> 0.94) and considerably lower errors. The quality improvement is especially noticeable in EI and IE classes, while N also improves, achieving more balanced and consistent performance.
In summary, the SVM model with RBF kernel outperforms the linear kernel in all relevant metrics, showing better generalization.
4.5 Implement the Decision Tree model¶
In this section, a decision tree model is built with two configurations: boosting and no boosting.
# Definir la función evaluar_model_arbol()
def evaluar_modelo_arbol(boosting=False, max_depth=None):
# Seleccionar modelo
if boosting:
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=max_depth, random_state=42)
nombre_modelo = 'Árbol con Boosting'
else:
model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
nombre_modelo = 'Árbol de Decisión'
# Entrenar
model.fit(X_train_encoded, y_train)
# Predecir
y_pred = model.predict(X_test_encoded)
# Reporte de clasificación
report = classification_report(
y_test, y_pred,
output_dict=True,
target_names=['EI', 'IE', 'N']
)
# Formatear DataFrame
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['modelo'] = nombre_modelo + ' depth=' + str(max_depth)
return df_report
# Evaluar ambos modelos
df_arbol = evaluar_modelo_arbol(boosting=False, max_depth=5)
df_boost = evaluar_modelo_arbol(boosting=True, max_depth=5)
# Combinar resultados
df_arboles_final = pd.concat([df_arbol, df_boost]).reset_index().rename(columns={'index': 'clase'})
# Mostrar
df_arboles_final
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.893130 | 0.962963 | 0.926733 | 243.0 | 0.037037 | Árbol de Decisión depth=5 |
| 1 | IE | 0.946429 | 0.858300 | 0.900212 | 247.0 | 0.141700 | Árbol de Decisión depth=5 |
| 2 | N | 0.959147 | 0.966011 | 0.962567 | 559.0 | 0.033989 | Árbol de Decisión depth=5 |
| 3 | EI | 0.932540 | 0.967078 | 0.949495 | 243.0 | 0.032922 | Árbol con Boosting depth=5 |
| 4 | IE | 0.942857 | 0.935223 | 0.939024 | 247.0 | 0.064777 | Árbol con Boosting depth=5 |
| 5 | N | 0.983696 | 0.971377 | 0.977498 | 559.0 | 0.028623 | Árbol con Boosting depth=5 |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_arboles_final.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_arboles_final.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
Comment on Decision Tree model¶
The decision tree model with depth 5 shows solid performance, with f1-scores above 0.90 for all classes and low errors. Class N stands out, maintaining an f1-score of 0.96, while EI and IE are stable, though IE has a higher error.
Applying Boosting improves all metrics. F1-score increases slightly across all classes, reaching values above 0.94. Errors also decrease, especially for class IE.
In summary, the Boosted Tree offers higher and more consistent performance. Combining Boosting with shallow trees seems to better capture the problem structure without overfitting.
4.6 Implement the Random Forest model with n = 50 and 100¶
In this section, a Random Forest model is built with two numbers of trees: n=50 and n=100.
# Definir la función evaluar_random_forest()
def evaluar_random_forest(n_estimators):
# Inicializar modelo
model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
# Entrenar
model.fit(X_train_encoded, y_train)
# Predecir
y_pred = model.predict(X_test_encoded)
# Obtener reporte por clase
report = classification_report(
y_test,
y_pred,
output_dict=True,
target_names=['EI', 'IE', 'N']
)
# Formatear resultado en DataFrame
df_report = pd.DataFrame(report).T.loc[['EI', 'IE', 'N']]
df_report['error'] = 1 - df_report['recall']
df_report['modelo'] = 'Random Forest' + ' n=' + str(n_estimators)
return df_report
# Evaluar Random Forest con 50 y 100 árboles
df_rf_50 = evaluar_random_forest(n_estimators=50)
df_rf_100 = evaluar_random_forest(n_estimators=100)
# Combinar resultados
df_rf_total = pd.concat([df_rf_50, df_rf_100]).reset_index().rename(columns={'index': 'clase'})
# Mostrar resultados
df_rf_total
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.925197 | 0.967078 | 0.945674 | 243.0 | 0.032922 | Random Forest n=50 |
| 1 | IE | 0.941909 | 0.919028 | 0.930328 | 247.0 | 0.080972 | Random Forest n=50 |
| 2 | N | 0.983755 | 0.974955 | 0.979335 | 559.0 | 0.025045 | Random Forest n=50 |
| 3 | EI | 0.928854 | 0.967078 | 0.947581 | 243.0 | 0.032922 | Random Forest n=100 |
| 4 | IE | 0.937759 | 0.914980 | 0.926230 | 247.0 | 0.085020 | Random Forest n=100 |
| 5 | N | 0.980180 | 0.973166 | 0.976661 | 559.0 | 0.026834 | Random Forest n=100 |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_rf_total.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar la error por modelo y clase
pivot_error = df_rf_total.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
Comment on Random Forest model¶
The Random Forest model with 50 trees shows very competitive performance. F1-scores exceed 0.93 across all classes, particularly class N. Errors are low and fairly balanced, except for class IE, indicating good generalization.
Increasing the number of trees to 100 does not show improvements. F1-scores remain the same, and errors remain low. This suggests that increasing the number of trees could help stabilize and fine-tune predictions, although current performance appears sufficient.
5 Comparison of proposed models' performance¶
# Crear tabla comparativa de todos los modelos implementados
# Combinar resultados
df_resultado_modelos = pd.concat([df_knn_final, df_naivebayes_final, df_ann_final, df_svm_final, df_arboles_final, df_rf_total]).reset_index(drop=True)
# Mostrar resultados
df_resultado_modelos
| clase | precision | recall | f1-score | support | error | modelo | |
|---|---|---|---|---|---|---|---|
| 0 | EI | 0.633929 | 0.876543 | 0.735751 | 243.0 | 0.123457 | kNN k=1 |
| 1 | IE | 0.662420 | 0.842105 | 0.741533 | 247.0 | 0.157895 | kNN k=1 |
| 2 | N | 0.919799 | 0.656530 | 0.766180 | 559.0 | 0.343470 | kNN k=1 |
| 3 | EI | 0.630986 | 0.921811 | 0.749164 | 243.0 | 0.078189 | kNN k=3 |
| 4 | IE | 0.763699 | 0.902834 | 0.827458 | 247.0 | 0.097166 | kNN k=3 |
| 5 | N | 0.965174 | 0.694097 | 0.807492 | 559.0 | 0.305903 | kNN k=3 |
| 6 | EI | 0.665698 | 0.942387 | 0.780239 | 243.0 | 0.057613 | kNN k=5 |
| 7 | IE | 0.719136 | 0.943320 | 0.816112 | 247.0 | 0.056680 | kNN k=5 |
| 8 | N | 0.989501 | 0.674419 | 0.802128 | 559.0 | 0.325581 | kNN k=5 |
| 9 | EI | 0.728707 | 0.950617 | 0.825000 | 243.0 | 0.049383 | kNN k=7 |
| 10 | IE | 0.750000 | 0.959514 | 0.841918 | 247.0 | 0.040486 | kNN k=7 |
| 11 | N | 0.990385 | 0.737030 | 0.845128 | 559.0 | 0.262970 | kNN k=7 |
| 12 | EI | 0.231649 | 1.000000 | 0.376161 | 243.0 | 0.000000 | Bernoulli B alpha=0 |
| 13 | IE | 0.000000 | 0.000000 | 0.000000 | 247.0 | 1.000000 | Bernoulli B alpha=0 |
| 14 | N | 0.000000 | 0.000000 | 0.000000 | 559.0 | 1.000000 | Bernoulli B alpha=0 |
| 15 | EI | 0.231649 | 1.000000 | 0.376161 | 243.0 | 0.000000 | BernoulliNB alpha=1 |
| 16 | IE | 0.000000 | 0.000000 | 0.000000 | 247.0 | 1.000000 | BernoulliNB alpha=1 |
| 17 | N | 0.000000 | 0.000000 | 0.000000 | 559.0 | 1.000000 | BernoulliNB alpha=1 |
| 18 | EI | 0.958506 | 0.950617 | 0.954545 | 243.0 | 0.049383 | GaussianNB |
| 19 | IE | 0.905882 | 0.935223 | 0.920319 | 247.0 | 0.064777 | GaussianNB |
| 20 | N | 0.978300 | 0.967800 | 0.973022 | 559.0 | 0.032200 | GaussianNB |
| 21 | EI | 0.954733 | 0.954733 | 0.954733 | 243.0 | 0.045267 | ANN p=5 |
| 22 | IE | 0.877323 | 0.955466 | 0.914729 | 247.0 | 0.044534 | ANN p=5 |
| 23 | N | 0.990689 | 0.951699 | 0.970803 | 559.0 | 0.048301 | ANN p=5 |
| 24 | EI | 0.881481 | 0.979424 | 0.927875 | 243.0 | 0.020576 | ANN p=10 |
| 25 | IE | 0.906504 | 0.902834 | 0.904665 | 247.0 | 0.097166 | ANN p=10 |
| 26 | N | 0.984991 | 0.939177 | 0.961538 | 559.0 | 0.060823 | ANN p=10 |
| 27 | EI | 0.960870 | 0.909465 | 0.934461 | 243.0 | 0.090535 | ANN p=20 |
| 28 | IE | 0.925620 | 0.906883 | 0.916155 | 247.0 | 0.093117 | ANN p=20 |
| 29 | N | 0.944541 | 0.974955 | 0.959507 | 559.0 | 0.025045 | ANN p=20 |
| 30 | EI | 0.882129 | 0.954733 | 0.916996 | 243.0 | 0.045267 | SVM linear |
| 31 | IE | 0.905738 | 0.894737 | 0.900204 | 247.0 | 0.105263 | SVM linear |
| 32 | N | 0.968635 | 0.939177 | 0.953678 | 559.0 | 0.060823 | SVM linear |
| 33 | EI | 0.947791 | 0.971193 | 0.959350 | 243.0 | 0.028807 | SVM rbf |
| 34 | IE | 0.931727 | 0.939271 | 0.935484 | 247.0 | 0.060729 | SVM rbf |
| 35 | N | 0.985481 | 0.971377 | 0.978378 | 559.0 | 0.028623 | SVM rbf |
| 36 | EI | 0.893130 | 0.962963 | 0.926733 | 243.0 | 0.037037 | Árbol de Decisión depth=5 |
| 37 | IE | 0.946429 | 0.858300 | 0.900212 | 247.0 | 0.141700 | Árbol de Decisión depth=5 |
| 38 | N | 0.959147 | 0.966011 | 0.962567 | 559.0 | 0.033989 | Árbol de Decisión depth=5 |
| 39 | EI | 0.932540 | 0.967078 | 0.949495 | 243.0 | 0.032922 | Árbol con Boosting depth=5 |
| 40 | IE | 0.942857 | 0.935223 | 0.939024 | 247.0 | 0.064777 | Árbol con Boosting depth=5 |
| 41 | N | 0.983696 | 0.971377 | 0.977498 | 559.0 | 0.028623 | Árbol con Boosting depth=5 |
| 42 | EI | 0.925197 | 0.967078 | 0.945674 | 243.0 | 0.032922 | Random Forest n=50 |
| 43 | IE | 0.941909 | 0.919028 | 0.930328 | 247.0 | 0.080972 | Random Forest n=50 |
| 44 | N | 0.983755 | 0.974955 | 0.979335 | 559.0 | 0.025045 | Random Forest n=50 |
| 45 | EI | 0.928854 | 0.967078 | 0.947581 | 243.0 | 0.032922 | Random Forest n=100 |
| 46 | IE | 0.937759 | 0.914980 | 0.926230 | 247.0 | 0.085020 | Random Forest n=100 |
| 47 | N | 0.980180 | 0.973166 | 0.976661 | 559.0 | 0.026834 | Random Forest n=100 |
# Graficar el f1-score por modelo y clase
pivot_f1_score = df_resultado_modelos.pivot_table(index='modelo', columns='clase', values='f1-score', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_f1_score, annot=True, cmap='inferno', fmt=".2f")
plt.title('f1-score por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
# Graficar el error por modelo y clase
pivot_error = df_resultado_modelos.pivot_table(index='modelo', columns='clase', values='error', aggfunc='mean')
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_error, annot=True, cmap='inferno', fmt=".2f")
plt.title('Error por Modelo y Clase')
plt.ylabel('Modelo')
plt.xlabel('Clase')
plt.tight_layout()
plt.show()
Overall comment on model selection¶
In this project, where the goal is to predict splicing sites (classes EI and IE), it is crucial to choose models that not only achieve good overall f1-scores but are also precise for these classes without confusing non-splicing regions (class N).
The best-performing models for this task are SVM RBF, Random Forest (n>50), and Boosted Tree, as they offer the best balance between EI/IE classes without sacrificing specificity for class N.
ANN models also perform very well, with high and balanced f1-scores. However, GaussianNB offers nearly the same performance with lower computational cost.
Finally, kNN models, although they improve with higher k values, have lower f1-scores than the other models. The BernoulliNB model was completely discarded as it could not capture the data complexity.