Used Car Price Predictions Solution for the KaggleX Skill Assessment Challenge¶

by Salomon Marquez

25/06/2024

This notebook showcases my solution for the KaggleX Skill Assessment Challenge, a prerequisite to apply for the KaggleX Fellowship Program. In this competition, I ranked 146th out of 1,846 participants.

My approach emphasized extensive feature engineering on the training dataset and augmenting it with additional data. For the model, I implemented a simple Multi-Layer Perceptron (MLP) to perform inference on the test dataset.

In [59]:

Copied!





# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/used-car-price-prediction-dataset-for-kagglex/sample_submission.csv
/kaggle/input/used-car-price-prediction-dataset-for-kagglex/used_cars.csv
/kaggle/input/used-car-price-prediction-dataset-for-kagglex/train.csv
/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv

In [60]:

Copied!





# Library Definition
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import re
from pandas.plotting import scatter_matrix

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
import category_encoders as ce

from scipy.stats import skew
from scipy.special import boxcox1p
# Library Definition
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import re
from pandas.plotting import scatter_matrix

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
import category_encoders as ce

from scipy.stats import skew
from scipy.special import boxcox1p

Loading Datasets¶

In [61]:

Copied!

train_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/train.csv')
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
dev_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/used_cars.csv')
train_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/train.csv')
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
dev_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/used_cars.csv')

Joining Train and Dev Datasets¶

In [62]:

Copied!





# Function to extract numeric values from the mileage string
def extract_numeric_value(text):
    # Define a pattern to match numeric values
    pattern = r'[\d]+'
    # Use findall() function from the re package to extract all numeric values from the text and 
    # join them into a single string
    return ''.join(re.findall(pattern, text))
# Function to extract numeric values from the mileage string
def extract_numeric_value(text):
    # Define a pattern to match numeric values
    pattern = r'[\d]+'
    # Use findall() function from the re package to extract all numeric values from the text and 
    # join them into a single string
    return ''.join(re.findall(pattern, text))

In [63]:

Copied!





# Apply the function to the DataFrame column
dev_df['milage'] = dev_df['milage'].apply(extract_numeric_value)

# Convert the mileage column to integer type
dev_df['milage'] = dev_df['milage'].astype(int)

# Apply the function to the DataFrame column
dev_df['price'] = dev_df['price'].apply(extract_numeric_value)

# Convert the mileage column to integer type
dev_df['price'] = dev_df['price'].astype(int)

# Define new index
start =  len(train_df)
end = len(train_df) + len(dev_df)

# Insert ID column in the beginning of the dev dataframe
dev_df.insert(0, 'id', range(start, end))

# Vertical concatenation 
train2_df = pd.concat([train_df, dev_df])

# Reset Indexes
train2_df.reset_index(drop=True, inplace=True)
# Apply the function to the DataFrame column
dev_df['milage'] = dev_df['milage'].apply(extract_numeric_value)

# Convert the mileage column to integer type
dev_df['milage'] = dev_df['milage'].astype(int)

# Apply the function to the DataFrame column
dev_df['price'] = dev_df['price'].apply(extract_numeric_value)

# Convert the mileage column to integer type
dev_df['price'] = dev_df['price'].astype(int)

# Define new index
start =  len(train_df)
end = len(train_df) + len(dev_df)

# Insert ID column in the beginning of the dev dataframe
dev_df.insert(0, 'id', range(start, end))

# Vertical concatenation 
train2_df = pd.concat([train_df, dev_df])

# Reset Indexes
train2_df.reset_index(drop=True, inplace=True)

Feature Engineering¶

Categorical Features¶

In [64]:

Copied!

# Create brandmodel feature
train2_df["brandmodel"] = train2_df['brand']+ ' ' + train2_df['model']
train2_df[['brand','model','brandmodel']].head(3)
# Create brandmodel feature
train2_df["brandmodel"] = train2_df['brand']+ ' ' + train2_df['model']
train2_df[['brand','model','brandmodel']].head(3)

Out[64]:

	brand	model	brandmodel
0	Ford	F-150 Lariat	Ford F-150 Lariat
1	BMW	335 i	BMW 335 i
2	Jaguar	XF Luxury	Jaguar XF Luxury

In [65]:

Copied!





# Extract  HoursePower, Engine Displacements, and Number of Cylinders
def get_engine_features(text):
    
    #Regular expresion pattern to extract engine values
    hp_pattern = r'(\d+\.\d+)HP' # HorsePower
    ed_pattern = r'(\d+\.\d+)L'  # Engine Displacement
    cylinders_pattern = r'I(\d+)|V(\d+)|(\d+) Cylinder' # Number of Cylinders
    valves_pattern = r'(\d+)V' # Number of Valves
    
    # Search for the patterns in the string
    hp_match = re.search(hp_pattern, text)
    ed_match = re.search(ed_pattern, text)
    cylinders_match = re.search(cylinders_pattern, text)
    valves_match = re.search(valves_pattern, text)
    
    # Extract and convert the matched values
    if hp_match:
        hp = float(hp_match.group(1))
    else:
        hp = np.nan

    if ed_match:
        ed = float(ed_match.group(1))
    else:
        ed = np.nan

    if cylinders_match:
        cylinders = int(cylinders_match.group(cylinders_match.lastindex))
    else:
        cylinders = np.nan
    
    if valves_match:
        valves = int(valves_match.group(1))
    else:
        valves = np.nan
        
    return pd.Series({
        'hp': hp,
        'ed': ed,
        'num_cylinders': cylinders,
        'num_valves': valves
    })    
# Extract  HoursePower, Engine Displacements, and Number of Cylinders
def get_engine_features(text):
    
    #Regular expresion pattern to extract engine values
    hp_pattern = r'(\d+\.\d+)HP' # HorsePower
    ed_pattern = r'(\d+\.\d+)L'  # Engine Displacement
    cylinders_pattern = r'I(\d+)|V(\d+)|(\d+) Cylinder' # Number of Cylinders
    valves_pattern = r'(\d+)V' # Number of Valves
    
    # Search for the patterns in the string
    hp_match = re.search(hp_pattern, text)
    ed_match = re.search(ed_pattern, text)
    cylinders_match = re.search(cylinders_pattern, text)
    valves_match = re.search(valves_pattern, text)
    
    # Extract and convert the matched values
    if hp_match:
        hp = float(hp_match.group(1))
    else:
        hp = np.nan

    if ed_match:
        ed = float(ed_match.group(1))
    else:
        ed = np.nan

    if cylinders_match:
        cylinders = int(cylinders_match.group(cylinders_match.lastindex))
    else:
        cylinders = np.nan
    
    if valves_match:
        valves = int(valves_match.group(1))
    else:
        valves = np.nan
        
    return pd.Series({
        'hp': hp,
        'ed': ed,
        'num_cylinders': cylinders,
        'num_valves': valves
    })    

In [66]:

Copied!





# Apply extraction function
train2_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = train2_df['engine'].apply(get_engine_features)

# Plot new features
train2_df[["hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(6,4))
plt.show()
# Apply extraction function
train2_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = train2_df['engine'].apply(get_engine_features)

# Plot new features
train2_df[["hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(6,4))
plt.show()

No description has been provided for this image

In [67]:

Copied!





# Function to transform transmission feature
def simple_transmission(text):
    
    # Patterns to identify
    patterns = {
        r'\b(automatic|a/t|at)\b':'automatic',
        r'\b(manual|m/t|mt)\b':'manual',
        r'\b(variable|cvt)\b':'cvt',
        r'\b(with|w/|at/mt)\b':'manumatic'
    }
      
    for pattern, replacement in patterns.items():
        if re.search(pattern, text, flags=re.IGNORECASE):
            return replacement
    
    return 'others'
# Function to transform transmission feature
def simple_transmission(text):
    
    # Patterns to identify
    patterns = {
        r'\b(automatic|a/t|at)\b':'automatic',
        r'\b(manual|m/t|mt)\b':'manual',
        r'\b(variable|cvt)\b':'cvt',
        r'\b(with|w/|at/mt)\b':'manumatic'
    }
      
    for pattern, replacement in patterns.items():
        if re.search(pattern, text, flags=re.IGNORECASE):
            return replacement
    
    return 'others'

In [68]:

Copied!

# Apply function 
train2_df['transmission'] = train2_df['transmission'].apply(simple_transmission)
# Apply function 
train2_df['transmission'] = train2_df['transmission'].apply(simple_transmission)

Numerical Features¶

In [69]:

Copied!

train2_df.info()
train2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             58282 non-null  int64  
 1   brand          58282 non-null  object 
 2   model          58282 non-null  object 
 3   model_year     58282 non-null  int64  
 4   milage         58282 non-null  int64  
 5   fuel_type      58112 non-null  object 
 6   engine         58282 non-null  object 
 7   transmission   58282 non-null  object 
 8   ext_col        58282 non-null  object 
 9   int_col        58282 non-null  object 
 10  accident       58169 non-null  object 
 11  clean_title    57686 non-null  object 
 12  price          58282 non-null  int64  
 13  brandmodel     58282 non-null  object 
 14  hp             53415 non-null  float64
 15  ed             57280 non-null  float64
 16  num_cylinders  57081 non-null  float64
 17  num_valves     4056 non-null   float64
dtypes: float64(4), int64(4), object(10)
memory usage: 8.0+ MB

In [70]:

Copied!

# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()

In [71]:

Copied!

# Plot scatter matrix
scatter_matrix(train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]], figsize=(8,8))
plt.show()
# Plot scatter matrix
scatter_matrix(train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]], figsize=(8,8))
plt.show()

In [72]:

Copied!





# Check the skew of all numerical features
skewed_feats = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
# Check the skew of all numerical features
skewed_feats = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

Skew in numerical features:

Out[72]:

	Skew
model_year	-0.951024
milage	0.874905
hp	0.661737
ed	0.525690
num_cylinders	0.195700
num_valves	36.235544

In [73]:

Copied!





# Reduce skeweness 
train2_df['milage']= train2_df['milage'].apply(lambda x: np.sqrt(x))

lambda_ = 0.15
train2_df['hp']= boxcox1p(train2_df['hp'], lambda_)
train2_df['ed']= boxcox1p(train2_df['ed'], lambda_)
# Reduce skeweness 
train2_df['milage']= train2_df['milage'].apply(lambda x: np.sqrt(x))

lambda_ = 0.15
train2_df['hp']= boxcox1p(train2_df['hp'], lambda_)
train2_df['ed']= boxcox1p(train2_df['ed'], lambda_)

In [74]:

Copied!

# percentile 
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
#model_year_quantile
# percentile 
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
#model_year_quantile

In [75]:

Copied!

train2_df['model_year'] = pd.cut(train2_df['model_year'],
                                        bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
                                        labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])
train2_df['model_year'] = pd.cut(train2_df['model_year'],
                                        bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
                                        labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])

In [76]:

Copied!





# percentile 
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.25, 0.5, 0.75, 1.0])

#train2_df['model_year'] = pd.cut(train2_df['model_year'],
#                                        bins=[1974, 2012, 2016, 2019, 2024],
#                                        labels=[2012, 2016, 2019, 2024])

#train2_df["model_year_skew"].hist(bins=10, figsize=(6,2))
#plt.show()
# percentile 
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.25, 0.5, 0.75, 1.0])

#train2_df['model_year'] = pd.cut(train2_df['model_year'],
#                                        bins=[1974, 2012, 2016, 2019, 2024],
#                                        labels=[2012, 2016, 2019, 2024])

#train2_df["model_year_skew"].hist(bins=10, figsize=(6,2))
#plt.show()

In [77]:

Copied!





# Check the skew of all numerical features
skewed_box = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness_box = pd.DataFrame({'Skew' :skewed_box})
skewness_box.head(10)
# Check the skew of all numerical features
skewed_box = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness_box = pd.DataFrame({'Skew' :skewed_box})
skewness_box.head(10)

Skew in numerical features:

Out[77]:

	Skew
model_year	-0.292276
milage	-0.080750
hp	-0.140137
ed	0.094202
num_cylinders	0.195700
num_valves	36.235544

In [78]:

Copied!

# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()

In [79]:

Copied!

train2_df.info()
train2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   id             58282 non-null  int64   
 1   brand          58282 non-null  object  
 2   model          58282 non-null  object  
 3   model_year     58276 non-null  category
 4   milage         58282 non-null  float64 
 5   fuel_type      58112 non-null  object  
 6   engine         58282 non-null  object  
 7   transmission   58282 non-null  object  
 8   ext_col        58282 non-null  object  
 9   int_col        58282 non-null  object  
 10  accident       58169 non-null  object  
 11  clean_title    57686 non-null  object  
 12  price          58282 non-null  int64   
 13  brandmodel     58282 non-null  object  
 14  hp             53415 non-null  float64 
 15  ed             57280 non-null  float64 
 16  num_cylinders  57081 non-null  float64 
 17  num_valves     4056 non-null   float64 
dtypes: category(1), float64(5), int64(2), object(10)
memory usage: 7.6+ MB

Principal Component Analysis¶

In [80]:

Copied!





from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor 
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor 

In [81]:

Copied!





def apply_pca(X, standardize=True):
    # Standardize 
    if standardize:
        X = (X - X.mean(axis=0)) / X.std(axis=0)
    
    # Create principal components 
    pca = PCA()
    X_pca = pca.fit_transform(X)
    
    # Convert to dataframe
    component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
    X_pca = pd.DataFrame(X_pca, columns = component_names)
    
    # Creating loadings
    loadings = pd.DataFrame(
        pca.components_.T,
        columns=component_names,
        index=X.columns    
    )
    return pca, X_pca, loadings

def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n+1)
    
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)    
    )
    
    # Cumulative Variance 
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)    
    )
    # Set up figure 
    fig.set(figwidth=width, dpi=dpi)
    return axs 

def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score       
def apply_pca(X, standardize=True):
    # Standardize 
    if standardize:
        X = (X - X.mean(axis=0)) / X.std(axis=0)
    
    # Create principal components 
    pca = PCA()
    X_pca = pca.fit_transform(X)
    
    # Convert to dataframe
    component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
    X_pca = pd.DataFrame(X_pca, columns = component_names)
    
    # Creating loadings
    loadings = pd.DataFrame(
        pca.components_.T,
        columns=component_names,
        index=X.columns    
    )
    return pca, X_pca, loadings

def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n+1)
    
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)    
    )
    
    # Cumulative Variance 
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)    
    )
    # Set up figure 
    fig.set(figwidth=width, dpi=dpi)
    return axs 

def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score       

In [82]:

Copied!





features = [
    "milage",
    "model_year",
    "hp",
    "ed",
    "num_cylinders"
]

print("Correlation with Price:\n")
print(train2_df[features].corrwith(train2_df.price))
features = [
    "milage",
    "model_year",
    "hp",
    "ed",
    "num_cylinders"
]

print("Correlation with Price:\n")
print(train2_df[features].corrwith(train2_df.price))

Correlation with Price:

milage          -0.285471
model_year       0.230884
hp               0.256449
ed               0.111935
num_cylinders    0.143659
dtype: float64

In [83]:

Copied!





X = train2_df.copy()
y = X.pop("price")
X = X.loc[:, features]
X['model_year']= X['model_year'].astype('float64')
X = train2_df.copy()
y = X.pop("price")
X = X.loc[:, features]
X['model_year']= X['model_year'].astype('float64')

In [84]:

Copied!

X.info()
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   milage         58282 non-null  float64
 1   model_year     58276 non-null  float64
 2   hp             53415 non-null  float64
 3   ed             57280 non-null  float64
 4   num_cylinders  57081 non-null  float64
dtypes: float64(5)
memory usage: 2.2 MB

In [85]:

Copied!

X.head(3)
X.head(3)

Out[85]:

	milage	model_year	hp	ed	num_cylinders
0	272.670130	2018.0	9.558416	1.687259	6.0
1	282.842712	2007.0	9.025890	1.540963	6.0
2	302.474792	2011.0	9.025890	1.870411	8.0

In [86]:

Copied!

# Create a pipeline for doing imputation and standard scaling 
num_pipeline  = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
preprocessing = ColumnTransformer([("num", num_pipeline, features)])
# Create a pipeline for doing imputation and standard scaling 
num_pipeline  = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
preprocessing = ColumnTransformer([("num", num_pipeline, features)])

In [87]:

Copied!

# Fit and transform the data
X_scaled = preprocessing.fit_transform(X)

# Create a new dataframe for scaled data
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
# Fit and transform the data
X_scaled = preprocessing.fit_transform(X)

# Create a new dataframe for scaled data
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

In [88]:

Copied!

X_scaled_df.head(3)
X_scaled_df.head(3)

Out[88]:

	milage	model_year	hp	ed	num_cylinders
0	0.124850	0.333333	0.487727	0.000000	0.0
1	0.196527	-1.500000	-0.082593	-0.344326	0.0
2	0.334855	-0.833333	-0.082593	0.431070	1.0

In [89]:

Copied!

# Apply pca
pca, X_pca, loadings = apply_pca(X_scaled_df)
print(loadings)
# Apply pca
pca, X_pca, loadings = apply_pca(X_scaled_df)
print(loadings)

                    PC1       PC2       PC3       PC4       PC5
milage        -0.263490  0.614072 -0.742920  0.004903  0.039187
model_year     0.223742 -0.644155 -0.610470 -0.395939  0.074582
hp             0.551986 -0.065634 -0.251091  0.783558 -0.118297
ed             0.528445  0.327686  0.046116 -0.428746 -0.653772
num_cylinders  0.544582  0.310313  0.101114 -0.213125  0.742624

In [90]:

Copied!

# Explained variance 
plot_variance(pca);
# Explained variance 
plot_variance(pca);

In [91]:

Copied!





sns.catplot(
    y="value",
    col="variable",
    data=X_pca.melt(),
    kind='boxen',
    sharey=False,
    col_wrap=3
);
sns.catplot(
    y="value",
    col="variable",
    data=X_pca.melt(),
    kind='boxen',
    sharey=False,
    col_wrap=3
);

/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

In [92]:

Copied!

train2_df.info()
train2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   id             58282 non-null  int64   
 1   brand          58282 non-null  object  
 2   model          58282 non-null  object  
 3   model_year     58276 non-null  category
 4   milage         58282 non-null  float64 
 5   fuel_type      58112 non-null  object  
 6   engine         58282 non-null  object  
 7   transmission   58282 non-null  object  
 8   ext_col        58282 non-null  object  
 9   int_col        58282 non-null  object  
 10  accident       58169 non-null  object  
 11  clean_title    57686 non-null  object  
 12  price          58282 non-null  int64   
 13  brandmodel     58282 non-null  object  
 14  hp             53415 non-null  float64 
 15  ed             57280 non-null  float64 
 16  num_cylinders  57081 non-null  float64 
 17  num_valves     4056 non-null   float64 
dtypes: category(1), float64(5), int64(2), object(10)
memory usage: 7.6+ MB

In [93]:

Copied!

# Choose a component PC1, PC2, PC3, or PC4
component = "PC5"

idx = X_pca[component].sort_values(ascending=False).index
train2_df.loc[idx, ["price","brandmodel","fuel_type","transmission","ext_col","int_col","accident","clean_title"] + features]
# Choose a component PC1, PC2, PC3, or PC4
component = "PC5"

idx = X_pca[component].sort_values(ascending=False).index
train2_df.loc[idx, ["price","brandmodel","fuel_type","transmission","ext_col","int_col","accident","clean_title"] + features]

Out[93]:

	price	brandmodel	fuel_type	transmission	ext_col	int_col	accident	clean_title	milage	model_year	hp	ed	num_cylinders
56486	289991	Bentley Continental GT GT Speed	Gasoline	automatic	Granite	Porpoise	None reported	NaN	43.806392	2024	NaN	NaN	12.0
19906	16500	BMW i3 94 Ah	Gasoline	automatic	Gray	Black	None reported	Yes	288.097206	2016	7.749778	0.520063	NaN
57741	16500	BMW i3 94 Ah	NaN	automatic	Black	–	None reported	Yes	207.364414	2018	7.749778	0.520063	NaN
57363	26990	BMW i3 120Ah w/Range Extender	NaN	automatic	Black	Black	None reported	Yes	130.288142	2019	7.749778	0.520063	NaN
11599	11499	BMW i3 Base w/Range Extender	Gasoline	automatic	White	Black	None reported	Yes	272.946881	2015	7.749778	0.520063	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
42944	6699	RAM 2500 Big Horn	Diesel	automatic	White	Gray	None reported	Yes	311.448230	2007	9.391827	2.388206	6.0
48383	25500	Dodge Ram 2500 Laramie Quad Cab	Diesel	automatic	Blue	Beige	None reported	Yes	193.963914	2007	9.391827	2.388206	6.0
52980	69995	RAM 3500 Tradesman	Diesel	automatic	Blue	Gray	None reported	Yes	54.680892	2007	9.715842	2.388206	6.0
48311	1950995	Bugatti Veyron 16.4 Grand Sport	Gasoline	automatic	White	Black	None reported	Yes	79.561297	2011	NaN	2.602594	NaN
54502	1950995	Bugatti Veyron 16.4 Grand Sport	Gasoline	automatic	White	White	None reported	Yes	79.561297	2011	NaN	2.602594	NaN

58282 rows × 13 columns

Dropping undesired features¶

In [94]:

Copied!

train2_df.info()
train2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   id             58282 non-null  int64   
 1   brand          58282 non-null  object  
 2   model          58282 non-null  object  
 3   model_year     58276 non-null  category
 4   milage         58282 non-null  float64 
 5   fuel_type      58112 non-null  object  
 6   engine         58282 non-null  object  
 7   transmission   58282 non-null  object  
 8   ext_col        58282 non-null  object  
 9   int_col        58282 non-null  object  
 10  accident       58169 non-null  object  
 11  clean_title    57686 non-null  object  
 12  price          58282 non-null  int64   
 13  brandmodel     58282 non-null  object  
 14  hp             53415 non-null  float64 
 15  ed             57280 non-null  float64 
 16  num_cylinders  57081 non-null  float64 
 17  num_valves     4056 non-null   float64 
dtypes: category(1), float64(5), int64(2), object(10)
memory usage: 7.6+ MB

In [95]:

Copied!





# Drop undesired columns 
car_labels = train2_df['price']
car_labels = car_labels.to_frame()

# Drop undesired features
train2_df.drop(['price','id','brand','model','engine','num_valves'], axis=1, inplace=True)
#train2_df.drop(['price','id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)
# Drop undesired columns 
car_labels = train2_df['price']
car_labels = car_labels.to_frame()

# Drop undesired features
train2_df.drop(['price','id','brand','model','engine','num_valves'], axis=1, inplace=True)
#train2_df.drop(['price','id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)

In [96]:

Copied!

train2_df.info()
train2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   model_year     58276 non-null  category
 1   milage         58282 non-null  float64 
 2   fuel_type      58112 non-null  object  
 3   transmission   58282 non-null  object  
 4   ext_col        58282 non-null  object  
 5   int_col        58282 non-null  object  
 6   accident       58169 non-null  object  
 7   clean_title    57686 non-null  object  
 8   brandmodel     58282 non-null  object  
 9   hp             53415 non-null  float64 
 10  ed             57280 non-null  float64 
 11  num_cylinders  57081 non-null  float64 
dtypes: category(1), float64(4), object(7)
memory usage: 4.9+ MB

In [97]:

Copied!

# Total imputation 
train2_df = train2_df.replace('–', np.nan)
# Total imputation 
train2_df = train2_df.replace('–', np.nan)

In [98]:

Copied!

# Run this cell for joining the additional features from the PCA analysis
#train2_df = train2_df.join(X_pca)
# Run this cell for joining the additional features from the PCA analysis
#train2_df = train2_df.join(X_pca)

In [99]:

Copied!

train2_df.info()
train2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   model_year     58276 non-null  category
 1   milage         58282 non-null  float64 
 2   fuel_type      57773 non-null  object  
 3   transmission   58282 non-null  object  
 4   ext_col        58226 non-null  object  
 5   int_col        57104 non-null  object  
 6   accident       58169 non-null  object  
 7   clean_title    57686 non-null  object  
 8   brandmodel     58282 non-null  object  
 9   hp             53415 non-null  float64 
 10  ed             57280 non-null  float64 
 11  num_cylinders  57081 non-null  float64 
dtypes: category(1), float64(4), object(7)
memory usage: 4.9+ MB

Builiding a Pipeline¶

In [103]:

Copied!





# Processing Pipeline
#num_features = ['model_year', 'milage','hp','ed','num_cylinders','PC1','PC2','PC3','PC4']
num_features = ['model_year', 'milage','hp','ed','num_cylinders']
cat_features = ['brandmodel','fuel_type','transmission','ext_col', 'int_col','accident','clean_title']

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown='ignore'))

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)])
# Processing Pipeline
#num_features = ['model_year', 'milage','hp','ed','num_cylinders','PC1','PC2','PC3','PC4']
num_features = ['model_year', 'milage','hp','ed','num_cylinders']
cat_features = ['brandmodel','fuel_type','transmission','ext_col', 'int_col','accident','clean_title']

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown='ignore'))

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)])

Model Implementation¶

In [104]:

Copied!





# MLP Model
mlp_reg = MLPRegressor(
    hidden_layer_sizes = [8,32,8],   # [8,32,8], [8,32,64,32,8],[8,64,8]
    activation ='relu', 
    solver = 'adam',
    alpha = 0.001,
    batch_size = 64,  # 32,64,128
    learning_rate = 'constant',
    max_iter = 400,
    shuffle = True,
    early_stopping = True, 
    validation_fraction = 0.1,
    random_state=42)
pipeline = make_pipeline(preprocessing, mlp_reg)
# MLP Model
mlp_reg = MLPRegressor(
    hidden_layer_sizes = [8,32,8],   # [8,32,8], [8,32,64,32,8],[8,64,8]
    activation ='relu', 
    solver = 'adam',
    alpha = 0.001,
    batch_size = 64,  # 32,64,128
    learning_rate = 'constant',
    max_iter = 400,
    shuffle = True,
    early_stopping = True, 
    validation_fraction = 0.1,
    random_state=42)
pipeline = make_pipeline(preprocessing, mlp_reg)

In [105]:

Copied!

pipeline.fit(train2_df, car_labels)
pipeline.fit(train2_df, car_labels)

/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Out[105]:

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('robustscaler',
                                                                   RobustScaler())]),
                                                  ['model_year', 'milage', 'hp',
                                                   'ed', 'num_cylinders']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['brandmodel', 'fuel_type',
                                                   'transmission', 'ext_col',
                                                   'int_col', 'accident',
                                                   'clean_title'])])),
                ('mlpregressor',
                 MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
                              hidden_layer_sizes=[8, 32, 8], max_iter=400,
                              random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('robustscaler',
                                                                   RobustScaler())]),
                                                  ['model_year', 'milage', 'hp',
                                                   'ed', 'num_cylinders']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['brandmodel', 'fuel_type',
                                                   'transmission', 'ext_col',
                                                   'int_col', 'accident',
                                                   'clean_title'])])),
                ('mlpregressor',
                 MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
                              hidden_layer_sizes=[8, 32, 8], max_iter=400,
                              random_state=42))])

columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('robustscaler',
                                                  RobustScaler())]),
                                 ['model_year', 'milage', 'hp', 'ed',
                                  'num_cylinders']),
                                ('cat',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['brandmodel', 'fuel_type', 'transmission',
                                  'ext_col', 'int_col', 'accident',
                                  'clean_title'])])

num

['model_year', 'milage', 'hp', 'ed', 'num_cylinders']

SimpleImputer

SimpleImputer(strategy='median')

RobustScaler

RobustScaler()

cat

['brandmodel', 'fuel_type', 'transmission', 'ext_col', 'int_col', 'accident', 'clean_title']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

MLPRegressor

MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
             hidden_layer_sizes=[8, 32, 8], max_iter=400, random_state=42)

Score a Baseline¶

In [106]:

Copied!





# Define K-Fold Cross Validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
baseline_scores = -cross_val_score(pipeline, train2_df, car_labels, scoring='neg_root_mean_squared_error', cv=kf)
print("Cross-Validation RMSE scores:", baseline_scores)
print("Average RMSE:", np.mean(baseline_scores))
# Define K-Fold Cross Validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
baseline_scores = -cross_val_score(pipeline, train2_df, car_labels, scoring='neg_root_mean_squared_error', cv=kf)
print("Cross-Validation RMSE scores:", baseline_scores)
print("Average RMSE:", np.mean(baseline_scores))

/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Cross-Validation RMSE scores: [60193.71277415 78966.61767177 77424.15910457 48134.62743974
 72502.42776201]
Average RMSE: 67444.30895044885

In [107]:

Copied!

pd.Series(baseline_scores).describe()
pd.Series(baseline_scores).describe()

Out[107]:

count        5.000000
mean     67444.308950
std      13070.773870
min      48134.627440
25%      60193.712774
50%      72502.427762
75%      77424.159105
max      78966.617672
dtype: float64

In [108]:

Copied!

pipeline.get_feature_names_out
pipeline.get_feature_names_out

Out[108]:

<bound method Pipeline.get_feature_names_out of Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('robustscaler',
                                                                   RobustScaler())]),
                                                  ['model_year', 'milage', 'hp',
                                                   'ed', 'num_cylinders']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['brandmodel', 'fuel_type',
                                                   'transmission', 'ext_col',
                                                   'int_col', 'accident',
                                                   'clean_title'])])),
                ('mlpregressor',
                 MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
                              hidden_layer_sizes=[8, 32, 8], max_iter=400,
                              random_state=42))])>

count 5.000000 mean 67444.308950 std 13070.773870 min 48134.627440 25% 60193.712774 50% 72502.427762 75% 77424.159105 max 78966.617672 dtype: float64

Evaluate Model on RMSE¶

In [109]:

Copied!





# Predict train2_df prices 67277.4197309128 | 67308.44019387408 | 67231.14714140317
car_price_prediction = pipeline.predict(train2_df)

# Model evaluation
lm_rmse = mean_squared_error(car_labels, car_price_prediction, squared=False)
lm_rmse
# Predict train2_df prices 67277.4197309128 | 67308.44019387408 | 67231.14714140317
car_price_prediction = pipeline.predict(train2_df)

# Model evaluation
lm_rmse = mean_squared_error(car_labels, car_price_prediction, squared=False)
lm_rmse

Out[109]:

67454.88485800906

Test Dataset¶

Feature Engineering of Categorical Features¶

In [121]:

Copied!

test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')

In [122]:

Copied!





# Create brandmodel feature
test_df["brandmodel"] = test_df['brand']+ ' ' + test_df['model']

# Create engine additional features
test_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = test_df['engine'].apply(get_engine_features)

# Create transmission features 
test_df['transmission'] = test_df['transmission'].apply(simple_transmission)
# Create brandmodel feature
test_df["brandmodel"] = test_df['brand']+ ' ' + test_df['model']

# Create engine additional features
test_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = test_df['engine'].apply(get_engine_features)

# Create transmission features 
test_df['transmission'] = test_df['transmission'].apply(simple_transmission)

Feature Engineering of Numerical Features¶

In [123]:

Copied!





# Replace milage with square root
test_df['milage']= test_df['milage'].apply(lambda x: np.sqrt(x))

lambda_ = 0.15
test_df['hp']= boxcox1p(test_df['hp'], lambda_)
test_df['ed']= boxcox1p(test_df['ed'], lambda_)

test_df['model_year'] = pd.cut(test_df['model_year'],
                                        bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
                                        labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])
# Replace milage with square root
test_df['milage']= test_df['milage'].apply(lambda x: np.sqrt(x))

lambda_ = 0.15
test_df['hp']= boxcox1p(test_df['hp'], lambda_)
test_df['ed']= boxcox1p(test_df['ed'], lambda_)

test_df['model_year'] = pd.cut(test_df['model_year'],
                                        bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
                                        labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])

In [ ]:

Copied!





# Replacing model_year with quantiles
#test_df['model_year'] = pd.cut(test_df['model_year'],
#                                        bins=[1974, 2012, 2016, 2019, 2024],
#                                        labels=[2012, 2016, 2019, 2024])
# Replacing model_year with quantiles
#test_df['model_year'] = pd.cut(test_df['model_year'],
#                                        bins=[1974, 2012, 2016, 2019, 2024],
#                                        labels=[2012, 2016, 2019, 2024])

In [124]:

Copied!

# Plot model_year with quantiles
test_df["model_year"].hist(bins=10, figsize=(6,2))
plt.show()
# Plot model_year with quantiles
test_df["model_year"].hist(bins=10, figsize=(6,2))
plt.show()

In [125]:

Copied!

# Plot numerical features
test_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
# Plot numerical features
test_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()

In [126]:

Copied!





# Remove undesired features
index_test = test_df['id']
index_test = index_test.to_frame()
test_df.drop(['id','engine','brand','model','num_valves'], axis=1, inplace=True)
#test_df.drop(['id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)
# Remove undesired features
index_test = test_df['id']
index_test = index_test.to_frame()
test_df.drop(['id','engine','brand','model','num_valves'], axis=1, inplace=True)
#test_df.drop(['id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)

In [127]:

Copied!

# Total imputation 
test_df = test_df.replace('–', np.nan)
# Total imputation 
test_df = test_df.replace('–', np.nan)

In [128]:

Copied!

test_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36183 entries, 0 to 36182
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   model_year     36182 non-null  category
 1   milage         36183 non-null  float64 
 2   fuel_type      35986 non-null  object  
 3   transmission   36183 non-null  object  
 4   ext_col        36156 non-null  object  
 5   int_col        35479 non-null  object  
 6   accident       36183 non-null  object  
 7   clean_title    36183 non-null  object  
 8   brandmodel     36183 non-null  object  
 9   hp             33577 non-null  float64 
 10  ed             35778 non-null  float64 
 11  num_cylinders  35679 non-null  float64 
dtypes: category(1), float64(4), object(7)
memory usage: 3.1+ MB

In [ ]:

Copied!





<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   model_year     58276 non-null  category
 1   milage         58282 non-null  float64 
 2   fuel_type      57773 non-null  object  
 3   transmission   58282 non-null  object  
 4   ext_col        58226 non-null  object  
 5   int_col        57104 non-null  object  
 6   accident       58169 non-null  object  
 7   clean_title    57686 non-null  object  
 8   brandmodel     58282 non-null  object  
 9   hp             53415 non-null  float64 
 10  ed             57280 non-null  float64 
 11  num_cylinders  57081 non-null  float64 
dtypes: category(1), float64(4), object(7)
memory usage: 4.9+ MB

RangeIndex: 58282 entries, 0 to 58281
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   model_year     58276 non-null  category
 1   milage         58282 non-null  float64 
 2   fuel_type      57773 non-null  object  
 3   transmission   58282 non-null  object  
 4   ext_col        58226 non-null  object  
 5   int_col        57104 non-null  object  
 6   accident       58169 non-null  object  
 7   clean_title    57686 non-null  object  
 8   brandmodel     58282 non-null  object  
 9   hp             53415 non-null  float64 
 10  ed             57280 non-null  float64 
 11  num_cylinders  57081 non-null  float64 
dtypes: category(1), float64(4), object(7)
memory usage: 4.9+ MB

In [134]:

Copied!





# Make predictions
#car_price_test_prediction = np.exp(pipeline.predict(test_df))
car_price_test_prediction = pipeline.predict(test_df)
car_price_test_prediction_df = pd.DataFrame(car_price_test_prediction, columns=['price'])
results = pd.concat([index_test, car_price_test_prediction_df], axis=1)
# Make predictions
#car_price_test_prediction = np.exp(pipeline.predict(test_df))
car_price_test_prediction = pipeline.predict(test_df)
car_price_test_prediction_df = pd.DataFrame(car_price_test_prediction, columns=['price'])
results = pd.concat([index_test, car_price_test_prediction_df], axis=1)

In [135]:

Copied!

results.head(3)
results.head(3)

Out[135]:

	id	price
0	54273	30081.982962
1	54274	19885.575096
2	54275	27483.291429

In [136]:

Copied!

# Save results to a csv file
results.to_csv('submission39-REPOFsubmission22.csv', index=False)
# Save results to a csv file
results.to_csv('submission39-REPOFsubmission22.csv', index=False)

In [ ]: