Used Car Price Predictions Solution for the KaggleX Skill Assessment Challenge¶
by Salomon Marquez
25/06/2024
This notebook showcases my solution for the KaggleX Skill Assessment Challenge, a prerequisite to apply for the KaggleX Fellowship Program. In this competition, I ranked 146th out of 1,846 participants.
My approach emphasized extensive feature engineering on the training dataset and augmenting it with additional data. For the model, I implemented a simple Multi-Layer Perceptron (MLP) to perform inference on the test dataset.
In [59]:
Copied!
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/used-car-price-prediction-dataset-for-kagglex/sample_submission.csv /kaggle/input/used-car-price-prediction-dataset-for-kagglex/used_cars.csv /kaggle/input/used-car-price-prediction-dataset-for-kagglex/train.csv /kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv
In [60]:
Copied!
# Library Definition
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
import category_encoders as ce
from scipy.stats import skew
from scipy.special import boxcox1p
# Library Definition
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
import category_encoders as ce
from scipy.stats import skew
from scipy.special import boxcox1p
Loading Datasets¶
In [61]:
Copied!
train_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/train.csv')
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
dev_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/used_cars.csv')
train_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/train.csv')
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
dev_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/used_cars.csv')
Joining Train and Dev Datasets¶
In [62]:
Copied!
# Function to extract numeric values from the mileage string
def extract_numeric_value(text):
# Define a pattern to match numeric values
pattern = r'[\d]+'
# Use findall() function from the re package to extract all numeric values from the text and
# join them into a single string
return ''.join(re.findall(pattern, text))
# Function to extract numeric values from the mileage string
def extract_numeric_value(text):
# Define a pattern to match numeric values
pattern = r'[\d]+'
# Use findall() function from the re package to extract all numeric values from the text and
# join them into a single string
return ''.join(re.findall(pattern, text))
In [63]:
Copied!
# Apply the function to the DataFrame column
dev_df['milage'] = dev_df['milage'].apply(extract_numeric_value)
# Convert the mileage column to integer type
dev_df['milage'] = dev_df['milage'].astype(int)
# Apply the function to the DataFrame column
dev_df['price'] = dev_df['price'].apply(extract_numeric_value)
# Convert the mileage column to integer type
dev_df['price'] = dev_df['price'].astype(int)
# Define new index
start = len(train_df)
end = len(train_df) + len(dev_df)
# Insert ID column in the beginning of the dev dataframe
dev_df.insert(0, 'id', range(start, end))
# Vertical concatenation
train2_df = pd.concat([train_df, dev_df])
# Reset Indexes
train2_df.reset_index(drop=True, inplace=True)
# Apply the function to the DataFrame column
dev_df['milage'] = dev_df['milage'].apply(extract_numeric_value)
# Convert the mileage column to integer type
dev_df['milage'] = dev_df['milage'].astype(int)
# Apply the function to the DataFrame column
dev_df['price'] = dev_df['price'].apply(extract_numeric_value)
# Convert the mileage column to integer type
dev_df['price'] = dev_df['price'].astype(int)
# Define new index
start = len(train_df)
end = len(train_df) + len(dev_df)
# Insert ID column in the beginning of the dev dataframe
dev_df.insert(0, 'id', range(start, end))
# Vertical concatenation
train2_df = pd.concat([train_df, dev_df])
# Reset Indexes
train2_df.reset_index(drop=True, inplace=True)
Feature Engineering¶
Categorical Features¶
In [64]:
Copied!
# Create brandmodel feature
train2_df["brandmodel"] = train2_df['brand']+ ' ' + train2_df['model']
train2_df[['brand','model','brandmodel']].head(3)
# Create brandmodel feature
train2_df["brandmodel"] = train2_df['brand']+ ' ' + train2_df['model']
train2_df[['brand','model','brandmodel']].head(3)
Out[64]:
| brand | model | brandmodel | |
|---|---|---|---|
| 0 | Ford | F-150 Lariat | Ford F-150 Lariat |
| 1 | BMW | 335 i | BMW 335 i |
| 2 | Jaguar | XF Luxury | Jaguar XF Luxury |
In [65]:
Copied!
# Extract HoursePower, Engine Displacements, and Number of Cylinders
def get_engine_features(text):
#Regular expresion pattern to extract engine values
hp_pattern = r'(\d+\.\d+)HP' # HorsePower
ed_pattern = r'(\d+\.\d+)L' # Engine Displacement
cylinders_pattern = r'I(\d+)|V(\d+)|(\d+) Cylinder' # Number of Cylinders
valves_pattern = r'(\d+)V' # Number of Valves
# Search for the patterns in the string
hp_match = re.search(hp_pattern, text)
ed_match = re.search(ed_pattern, text)
cylinders_match = re.search(cylinders_pattern, text)
valves_match = re.search(valves_pattern, text)
# Extract and convert the matched values
if hp_match:
hp = float(hp_match.group(1))
else:
hp = np.nan
if ed_match:
ed = float(ed_match.group(1))
else:
ed = np.nan
if cylinders_match:
cylinders = int(cylinders_match.group(cylinders_match.lastindex))
else:
cylinders = np.nan
if valves_match:
valves = int(valves_match.group(1))
else:
valves = np.nan
return pd.Series({
'hp': hp,
'ed': ed,
'num_cylinders': cylinders,
'num_valves': valves
})
# Extract HoursePower, Engine Displacements, and Number of Cylinders
def get_engine_features(text):
#Regular expresion pattern to extract engine values
hp_pattern = r'(\d+\.\d+)HP' # HorsePower
ed_pattern = r'(\d+\.\d+)L' # Engine Displacement
cylinders_pattern = r'I(\d+)|V(\d+)|(\d+) Cylinder' # Number of Cylinders
valves_pattern = r'(\d+)V' # Number of Valves
# Search for the patterns in the string
hp_match = re.search(hp_pattern, text)
ed_match = re.search(ed_pattern, text)
cylinders_match = re.search(cylinders_pattern, text)
valves_match = re.search(valves_pattern, text)
# Extract and convert the matched values
if hp_match:
hp = float(hp_match.group(1))
else:
hp = np.nan
if ed_match:
ed = float(ed_match.group(1))
else:
ed = np.nan
if cylinders_match:
cylinders = int(cylinders_match.group(cylinders_match.lastindex))
else:
cylinders = np.nan
if valves_match:
valves = int(valves_match.group(1))
else:
valves = np.nan
return pd.Series({
'hp': hp,
'ed': ed,
'num_cylinders': cylinders,
'num_valves': valves
})
In [66]:
Copied!
# Apply extraction function
train2_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = train2_df['engine'].apply(get_engine_features)
# Plot new features
train2_df[["hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(6,4))
plt.show()
# Apply extraction function
train2_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = train2_df['engine'].apply(get_engine_features)
# Plot new features
train2_df[["hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(6,4))
plt.show()
In [67]:
Copied!
# Function to transform transmission feature
def simple_transmission(text):
# Patterns to identify
patterns = {
r'\b(automatic|a/t|at)\b':'automatic',
r'\b(manual|m/t|mt)\b':'manual',
r'\b(variable|cvt)\b':'cvt',
r'\b(with|w/|at/mt)\b':'manumatic'
}
for pattern, replacement in patterns.items():
if re.search(pattern, text, flags=re.IGNORECASE):
return replacement
return 'others'
# Function to transform transmission feature
def simple_transmission(text):
# Patterns to identify
patterns = {
r'\b(automatic|a/t|at)\b':'automatic',
r'\b(manual|m/t|mt)\b':'manual',
r'\b(variable|cvt)\b':'cvt',
r'\b(with|w/|at/mt)\b':'manumatic'
}
for pattern, replacement in patterns.items():
if re.search(pattern, text, flags=re.IGNORECASE):
return replacement
return 'others'
In [68]:
Copied!
# Apply function
train2_df['transmission'] = train2_df['transmission'].apply(simple_transmission)
# Apply function
train2_df['transmission'] = train2_df['transmission'].apply(simple_transmission)
Numerical Features¶
In [69]:
Copied!
train2_df.info()
train2_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 58282 non-null int64 1 brand 58282 non-null object 2 model 58282 non-null object 3 model_year 58282 non-null int64 4 milage 58282 non-null int64 5 fuel_type 58112 non-null object 6 engine 58282 non-null object 7 transmission 58282 non-null object 8 ext_col 58282 non-null object 9 int_col 58282 non-null object 10 accident 58169 non-null object 11 clean_title 57686 non-null object 12 price 58282 non-null int64 13 brandmodel 58282 non-null object 14 hp 53415 non-null float64 15 ed 57280 non-null float64 16 num_cylinders 57081 non-null float64 17 num_valves 4056 non-null float64 dtypes: float64(4), int64(4), object(10) memory usage: 8.0+ MB
In [70]:
Copied!
# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
In [71]:
Copied!
# Plot scatter matrix
scatter_matrix(train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]], figsize=(8,8))
plt.show()
# Plot scatter matrix
scatter_matrix(train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]], figsize=(8,8))
plt.show()
In [72]:
Copied!
# Check the skew of all numerical features
skewed_feats = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
# Check the skew of all numerical features
skewed_feats = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Skew in numerical features:
Out[72]:
| Skew | |
|---|---|
| model_year | -0.951024 |
| milage | 0.874905 |
| hp | 0.661737 |
| ed | 0.525690 |
| num_cylinders | 0.195700 |
| num_valves | 36.235544 |
In [73]:
Copied!
# Reduce skeweness
train2_df['milage']= train2_df['milage'].apply(lambda x: np.sqrt(x))
lambda_ = 0.15
train2_df['hp']= boxcox1p(train2_df['hp'], lambda_)
train2_df['ed']= boxcox1p(train2_df['ed'], lambda_)
# Reduce skeweness
train2_df['milage']= train2_df['milage'].apply(lambda x: np.sqrt(x))
lambda_ = 0.15
train2_df['hp']= boxcox1p(train2_df['hp'], lambda_)
train2_df['ed']= boxcox1p(train2_df['ed'], lambda_)
In [74]:
Copied!
# percentile
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
#model_year_quantile
# percentile
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
#model_year_quantile
In [75]:
Copied!
train2_df['model_year'] = pd.cut(train2_df['model_year'],
bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])
train2_df['model_year'] = pd.cut(train2_df['model_year'],
bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])
In [76]:
Copied!
# percentile
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.25, 0.5, 0.75, 1.0])
#train2_df['model_year'] = pd.cut(train2_df['model_year'],
# bins=[1974, 2012, 2016, 2019, 2024],
# labels=[2012, 2016, 2019, 2024])
#train2_df["model_year_skew"].hist(bins=10, figsize=(6,2))
#plt.show()
# percentile
#model_year_quantile = train2_df["model_year"].quantile([0.0, 0.25, 0.5, 0.75, 1.0])
#train2_df['model_year'] = pd.cut(train2_df['model_year'],
# bins=[1974, 2012, 2016, 2019, 2024],
# labels=[2012, 2016, 2019, 2024])
#train2_df["model_year_skew"].hist(bins=10, figsize=(6,2))
#plt.show()
In [77]:
Copied!
# Check the skew of all numerical features
skewed_box = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness_box = pd.DataFrame({'Skew' :skewed_box})
skewness_box.head(10)
# Check the skew of all numerical features
skewed_box = train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].apply(lambda x: skew(x.dropna()))
print("\nSkew in numerical features: \n")
skewness_box = pd.DataFrame({'Skew' :skewed_box})
skewness_box.head(10)
Skew in numerical features:
Out[77]:
| Skew | |
|---|---|
| model_year | -0.292276 |
| milage | -0.080750 |
| hp | -0.140137 |
| ed | 0.094202 |
| num_cylinders | 0.195700 |
| num_valves | 36.235544 |
In [78]:
Copied!
# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
# Plot numerical features
train2_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
In [79]:
Copied!
train2_df.info()
train2_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 58282 non-null int64 1 brand 58282 non-null object 2 model 58282 non-null object 3 model_year 58276 non-null category 4 milage 58282 non-null float64 5 fuel_type 58112 non-null object 6 engine 58282 non-null object 7 transmission 58282 non-null object 8 ext_col 58282 non-null object 9 int_col 58282 non-null object 10 accident 58169 non-null object 11 clean_title 57686 non-null object 12 price 58282 non-null int64 13 brandmodel 58282 non-null object 14 hp 53415 non-null float64 15 ed 57280 non-null float64 16 num_cylinders 57081 non-null float64 17 num_valves 4056 non-null float64 dtypes: category(1), float64(5), int64(2), object(10) memory usage: 7.6+ MB
Principal Component Analysis¶
In [80]:
Copied!
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
In [81]:
Copied!
def apply_pca(X, standardize=True):
# Standardize
if standardize:
X = (X - X.mean(axis=0)) / X.std(axis=0)
# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X)
# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns = component_names)
# Creating loadings
loadings = pd.DataFrame(
pca.components_.T,
columns=component_names,
index=X.columns
)
return pca, X_pca, loadings
def plot_variance(pca, width=8, dpi=100):
# Create figure
fig, axs = plt.subplots(1, 2)
n = pca.n_components_
grid = np.arange(1, n+1)
# Explained variance
evr = pca.explained_variance_ratio_
axs[0].bar(grid, evr)
axs[0].set(
xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
)
# Cumulative Variance
cv = np.cumsum(evr)
axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
axs[1].set(
xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
)
# Set up figure
fig.set(figwidth=width, dpi=dpi)
return axs
def make_mi_scores(X, y):
X = X.copy()
for colname in X.select_dtypes(["object", "category"]):
X[colname], _ = X[colname].factorize()
# All discrete features should now have integer dtypes
discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
def score_dataset(X, y, model=XGBRegressor()):
# Label encoding for categoricals
for colname in X.select_dtypes(["category", "object"]):
X[colname], _ = X[colname].factorize()
# Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
score = cross_val_score(
model, X, y, cv=5, scoring="neg_mean_squared_log_error",
)
score = -1 * score.mean()
score = np.sqrt(score)
return score
def apply_pca(X, standardize=True):
# Standardize
if standardize:
X = (X - X.mean(axis=0)) / X.std(axis=0)
# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X)
# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns = component_names)
# Creating loadings
loadings = pd.DataFrame(
pca.components_.T,
columns=component_names,
index=X.columns
)
return pca, X_pca, loadings
def plot_variance(pca, width=8, dpi=100):
# Create figure
fig, axs = plt.subplots(1, 2)
n = pca.n_components_
grid = np.arange(1, n+1)
# Explained variance
evr = pca.explained_variance_ratio_
axs[0].bar(grid, evr)
axs[0].set(
xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
)
# Cumulative Variance
cv = np.cumsum(evr)
axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
axs[1].set(
xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
)
# Set up figure
fig.set(figwidth=width, dpi=dpi)
return axs
def make_mi_scores(X, y):
X = X.copy()
for colname in X.select_dtypes(["object", "category"]):
X[colname], _ = X[colname].factorize()
# All discrete features should now have integer dtypes
discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
def score_dataset(X, y, model=XGBRegressor()):
# Label encoding for categoricals
for colname in X.select_dtypes(["category", "object"]):
X[colname], _ = X[colname].factorize()
# Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
score = cross_val_score(
model, X, y, cv=5, scoring="neg_mean_squared_log_error",
)
score = -1 * score.mean()
score = np.sqrt(score)
return score
In [82]:
Copied!
features = [
"milage",
"model_year",
"hp",
"ed",
"num_cylinders"
]
print("Correlation with Price:\n")
print(train2_df[features].corrwith(train2_df.price))
features = [
"milage",
"model_year",
"hp",
"ed",
"num_cylinders"
]
print("Correlation with Price:\n")
print(train2_df[features].corrwith(train2_df.price))
Correlation with Price: milage -0.285471 model_year 0.230884 hp 0.256449 ed 0.111935 num_cylinders 0.143659 dtype: float64
In [83]:
Copied!
X = train2_df.copy()
y = X.pop("price")
X = X.loc[:, features]
X['model_year']= X['model_year'].astype('float64')
X = train2_df.copy()
y = X.pop("price")
X = X.loc[:, features]
X['model_year']= X['model_year'].astype('float64')
In [84]:
Copied!
X.info()
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 milage 58282 non-null float64 1 model_year 58276 non-null float64 2 hp 53415 non-null float64 3 ed 57280 non-null float64 4 num_cylinders 57081 non-null float64 dtypes: float64(5) memory usage: 2.2 MB
In [85]:
Copied!
X.head(3)
X.head(3)
Out[85]:
| milage | model_year | hp | ed | num_cylinders | |
|---|---|---|---|---|---|
| 0 | 272.670130 | 2018.0 | 9.558416 | 1.687259 | 6.0 |
| 1 | 282.842712 | 2007.0 | 9.025890 | 1.540963 | 6.0 |
| 2 | 302.474792 | 2011.0 | 9.025890 | 1.870411 | 8.0 |
In [86]:
Copied!
# Create a pipeline for doing imputation and standard scaling
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
preprocessing = ColumnTransformer([("num", num_pipeline, features)])
# Create a pipeline for doing imputation and standard scaling
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
preprocessing = ColumnTransformer([("num", num_pipeline, features)])
In [87]:
Copied!
# Fit and transform the data
X_scaled = preprocessing.fit_transform(X)
# Create a new dataframe for scaled data
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
# Fit and transform the data
X_scaled = preprocessing.fit_transform(X)
# Create a new dataframe for scaled data
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
In [88]:
Copied!
X_scaled_df.head(3)
X_scaled_df.head(3)
Out[88]:
| milage | model_year | hp | ed | num_cylinders | |
|---|---|---|---|---|---|
| 0 | 0.124850 | 0.333333 | 0.487727 | 0.000000 | 0.0 |
| 1 | 0.196527 | -1.500000 | -0.082593 | -0.344326 | 0.0 |
| 2 | 0.334855 | -0.833333 | -0.082593 | 0.431070 | 1.0 |
In [89]:
Copied!
# Apply pca
pca, X_pca, loadings = apply_pca(X_scaled_df)
print(loadings)
# Apply pca
pca, X_pca, loadings = apply_pca(X_scaled_df)
print(loadings)
PC1 PC2 PC3 PC4 PC5 milage -0.263490 0.614072 -0.742920 0.004903 0.039187 model_year 0.223742 -0.644155 -0.610470 -0.395939 0.074582 hp 0.551986 -0.065634 -0.251091 0.783558 -0.118297 ed 0.528445 0.327686 0.046116 -0.428746 -0.653772 num_cylinders 0.544582 0.310313 0.101114 -0.213125 0.742624
In [90]:
Copied!
# Explained variance
plot_variance(pca);
# Explained variance
plot_variance(pca);
In [91]:
Copied!
sns.catplot(
y="value",
col="variable",
data=X_pca.melt(),
kind='boxen',
sharey=False,
col_wrap=3
);
sns.catplot(
y="value",
col="variable",
data=X_pca.melt(),
kind='boxen',
sharey=False,
col_wrap=3
);
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:1794: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
In [92]:
Copied!
train2_df.info()
train2_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 58282 non-null int64 1 brand 58282 non-null object 2 model 58282 non-null object 3 model_year 58276 non-null category 4 milage 58282 non-null float64 5 fuel_type 58112 non-null object 6 engine 58282 non-null object 7 transmission 58282 non-null object 8 ext_col 58282 non-null object 9 int_col 58282 non-null object 10 accident 58169 non-null object 11 clean_title 57686 non-null object 12 price 58282 non-null int64 13 brandmodel 58282 non-null object 14 hp 53415 non-null float64 15 ed 57280 non-null float64 16 num_cylinders 57081 non-null float64 17 num_valves 4056 non-null float64 dtypes: category(1), float64(5), int64(2), object(10) memory usage: 7.6+ MB
In [93]:
Copied!
# Choose a component PC1, PC2, PC3, or PC4
component = "PC5"
idx = X_pca[component].sort_values(ascending=False).index
train2_df.loc[idx, ["price","brandmodel","fuel_type","transmission","ext_col","int_col","accident","clean_title"] + features]
# Choose a component PC1, PC2, PC3, or PC4
component = "PC5"
idx = X_pca[component].sort_values(ascending=False).index
train2_df.loc[idx, ["price","brandmodel","fuel_type","transmission","ext_col","int_col","accident","clean_title"] + features]
Out[93]:
| price | brandmodel | fuel_type | transmission | ext_col | int_col | accident | clean_title | milage | model_year | hp | ed | num_cylinders | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56486 | 289991 | Bentley Continental GT GT Speed | Gasoline | automatic | Granite | Porpoise | None reported | NaN | 43.806392 | 2024 | NaN | NaN | 12.0 |
| 19906 | 16500 | BMW i3 94 Ah | Gasoline | automatic | Gray | Black | None reported | Yes | 288.097206 | 2016 | 7.749778 | 0.520063 | NaN |
| 57741 | 16500 | BMW i3 94 Ah | NaN | automatic | Black | – | None reported | Yes | 207.364414 | 2018 | 7.749778 | 0.520063 | NaN |
| 57363 | 26990 | BMW i3 120Ah w/Range Extender | NaN | automatic | Black | Black | None reported | Yes | 130.288142 | 2019 | 7.749778 | 0.520063 | NaN |
| 11599 | 11499 | BMW i3 Base w/Range Extender | Gasoline | automatic | White | Black | None reported | Yes | 272.946881 | 2015 | 7.749778 | 0.520063 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 42944 | 6699 | RAM 2500 Big Horn | Diesel | automatic | White | Gray | None reported | Yes | 311.448230 | 2007 | 9.391827 | 2.388206 | 6.0 |
| 48383 | 25500 | Dodge Ram 2500 Laramie Quad Cab | Diesel | automatic | Blue | Beige | None reported | Yes | 193.963914 | 2007 | 9.391827 | 2.388206 | 6.0 |
| 52980 | 69995 | RAM 3500 Tradesman | Diesel | automatic | Blue | Gray | None reported | Yes | 54.680892 | 2007 | 9.715842 | 2.388206 | 6.0 |
| 48311 | 1950995 | Bugatti Veyron 16.4 Grand Sport | Gasoline | automatic | White | Black | None reported | Yes | 79.561297 | 2011 | NaN | 2.602594 | NaN |
| 54502 | 1950995 | Bugatti Veyron 16.4 Grand Sport | Gasoline | automatic | White | White | None reported | Yes | 79.561297 | 2011 | NaN | 2.602594 | NaN |
58282 rows × 13 columns
Dropping undesired features¶
In [94]:
Copied!
train2_df.info()
train2_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 58282 non-null int64 1 brand 58282 non-null object 2 model 58282 non-null object 3 model_year 58276 non-null category 4 milage 58282 non-null float64 5 fuel_type 58112 non-null object 6 engine 58282 non-null object 7 transmission 58282 non-null object 8 ext_col 58282 non-null object 9 int_col 58282 non-null object 10 accident 58169 non-null object 11 clean_title 57686 non-null object 12 price 58282 non-null int64 13 brandmodel 58282 non-null object 14 hp 53415 non-null float64 15 ed 57280 non-null float64 16 num_cylinders 57081 non-null float64 17 num_valves 4056 non-null float64 dtypes: category(1), float64(5), int64(2), object(10) memory usage: 7.6+ MB
In [95]:
Copied!
# Drop undesired columns
car_labels = train2_df['price']
car_labels = car_labels.to_frame()
# Drop undesired features
train2_df.drop(['price','id','brand','model','engine','num_valves'], axis=1, inplace=True)
#train2_df.drop(['price','id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)
# Drop undesired columns
car_labels = train2_df['price']
car_labels = car_labels.to_frame()
# Drop undesired features
train2_df.drop(['price','id','brand','model','engine','num_valves'], axis=1, inplace=True)
#train2_df.drop(['price','id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)
In [96]:
Copied!
train2_df.info()
train2_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model_year 58276 non-null category 1 milage 58282 non-null float64 2 fuel_type 58112 non-null object 3 transmission 58282 non-null object 4 ext_col 58282 non-null object 5 int_col 58282 non-null object 6 accident 58169 non-null object 7 clean_title 57686 non-null object 8 brandmodel 58282 non-null object 9 hp 53415 non-null float64 10 ed 57280 non-null float64 11 num_cylinders 57081 non-null float64 dtypes: category(1), float64(4), object(7) memory usage: 4.9+ MB
In [97]:
Copied!
# Total imputation
train2_df = train2_df.replace('–', np.nan)
# Total imputation
train2_df = train2_df.replace('–', np.nan)
In [98]:
Copied!
# Run this cell for joining the additional features from the PCA analysis
#train2_df = train2_df.join(X_pca)
# Run this cell for joining the additional features from the PCA analysis
#train2_df = train2_df.join(X_pca)
In [99]:
Copied!
train2_df.info()
train2_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58282 entries, 0 to 58281 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model_year 58276 non-null category 1 milage 58282 non-null float64 2 fuel_type 57773 non-null object 3 transmission 58282 non-null object 4 ext_col 58226 non-null object 5 int_col 57104 non-null object 6 accident 58169 non-null object 7 clean_title 57686 non-null object 8 brandmodel 58282 non-null object 9 hp 53415 non-null float64 10 ed 57280 non-null float64 11 num_cylinders 57081 non-null float64 dtypes: category(1), float64(4), object(7) memory usage: 4.9+ MB
Builiding a Pipeline¶
In [103]:
Copied!
# Processing Pipeline
#num_features = ['model_year', 'milage','hp','ed','num_cylinders','PC1','PC2','PC3','PC4']
num_features = ['model_year', 'milage','hp','ed','num_cylinders']
cat_features = ['brandmodel','fuel_type','transmission','ext_col', 'int_col','accident','clean_title']
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown='ignore'))
preprocessing = ColumnTransformer([
("num", num_pipeline, num_features),
("cat", cat_pipeline, cat_features)])
# Processing Pipeline
#num_features = ['model_year', 'milage','hp','ed','num_cylinders','PC1','PC2','PC3','PC4']
num_features = ['model_year', 'milage','hp','ed','num_cylinders']
cat_features = ['brandmodel','fuel_type','transmission','ext_col', 'int_col','accident','clean_title']
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), RobustScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown='ignore'))
preprocessing = ColumnTransformer([
("num", num_pipeline, num_features),
("cat", cat_pipeline, cat_features)])
Model Implementation¶
In [104]:
Copied!
# MLP Model
mlp_reg = MLPRegressor(
hidden_layer_sizes = [8,32,8], # [8,32,8], [8,32,64,32,8],[8,64,8]
activation ='relu',
solver = 'adam',
alpha = 0.001,
batch_size = 64, # 32,64,128
learning_rate = 'constant',
max_iter = 400,
shuffle = True,
early_stopping = True,
validation_fraction = 0.1,
random_state=42)
pipeline = make_pipeline(preprocessing, mlp_reg)
# MLP Model
mlp_reg = MLPRegressor(
hidden_layer_sizes = [8,32,8], # [8,32,8], [8,32,64,32,8],[8,64,8]
activation ='relu',
solver = 'adam',
alpha = 0.001,
batch_size = 64, # 32,64,128
learning_rate = 'constant',
max_iter = 400,
shuffle = True,
early_stopping = True,
validation_fraction = 0.1,
random_state=42)
pipeline = make_pipeline(preprocessing, mlp_reg)
In [105]:
Copied!
pipeline.fit(train2_df, car_labels)
pipeline.fit(train2_df, car_labels)
/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Out[105]:
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('robustscaler',
RobustScaler())]),
['model_year', 'milage', 'hp',
'ed', 'num_cylinders']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['brandmodel', 'fuel_type',
'transmission', 'ext_col',
'int_col', 'accident',
'clean_title'])])),
('mlpregressor',
MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
hidden_layer_sizes=[8, 32, 8], max_iter=400,
random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('robustscaler',
RobustScaler())]),
['model_year', 'milage', 'hp',
'ed', 'num_cylinders']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['brandmodel', 'fuel_type',
'transmission', 'ext_col',
'int_col', 'accident',
'clean_title'])])),
('mlpregressor',
MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
hidden_layer_sizes=[8, 32, 8], max_iter=400,
random_state=42))])ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('robustscaler',
RobustScaler())]),
['model_year', 'milage', 'hp', 'ed',
'num_cylinders']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['brandmodel', 'fuel_type', 'transmission',
'ext_col', 'int_col', 'accident',
'clean_title'])])['model_year', 'milage', 'hp', 'ed', 'num_cylinders']
SimpleImputer(strategy='median')
RobustScaler()
['brandmodel', 'fuel_type', 'transmission', 'ext_col', 'int_col', 'accident', 'clean_title']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore')
MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
hidden_layer_sizes=[8, 32, 8], max_iter=400, random_state=42)Score a Baseline¶
In [106]:
Copied!
# Define K-Fold Cross Validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform cross-validation
baseline_scores = -cross_val_score(pipeline, train2_df, car_labels, scoring='neg_root_mean_squared_error', cv=kf)
print("Cross-Validation RMSE scores:", baseline_scores)
print("Average RMSE:", np.mean(baseline_scores))
# Define K-Fold Cross Validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform cross-validation
baseline_scores = -cross_val_score(pipeline, train2_df, car_labels, scoring='neg_root_mean_squared_error', cv=kf)
print("Cross-Validation RMSE scores:", baseline_scores)
print("Average RMSE:", np.mean(baseline_scores))
/opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /opt/conda/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:1623: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Cross-Validation RMSE scores: [60193.71277415 78966.61767177 77424.15910457 48134.62743974 72502.42776201] Average RMSE: 67444.30895044885
In [107]:
Copied!
pd.Series(baseline_scores).describe()
pd.Series(baseline_scores).describe()
Out[107]:
count 5.000000 mean 67444.308950 std 13070.773870 min 48134.627440 25% 60193.712774 50% 72502.427762 75% 77424.159105 max 78966.617672 dtype: float64
In [108]:
Copied!
pipeline.get_feature_names_out
pipeline.get_feature_names_out
Out[108]:
<bound method Pipeline.get_feature_names_out of Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('robustscaler',
RobustScaler())]),
['model_year', 'milage', 'hp',
'ed', 'num_cylinders']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['brandmodel', 'fuel_type',
'transmission', 'ext_col',
'int_col', 'accident',
'clean_title'])])),
('mlpregressor',
MLPRegressor(alpha=0.001, batch_size=64, early_stopping=True,
hidden_layer_sizes=[8, 32, 8], max_iter=400,
random_state=42))])>
count 5.000000 mean 67444.308950 std 13070.773870 min 48134.627440 25% 60193.712774 50% 72502.427762 75% 77424.159105 max 78966.617672 dtype: float64
Evaluate Model on RMSE¶
In [109]:
Copied!
# Predict train2_df prices 67277.4197309128 | 67308.44019387408 | 67231.14714140317
car_price_prediction = pipeline.predict(train2_df)
# Model evaluation
lm_rmse = mean_squared_error(car_labels, car_price_prediction, squared=False)
lm_rmse
# Predict train2_df prices 67277.4197309128 | 67308.44019387408 | 67231.14714140317
car_price_prediction = pipeline.predict(train2_df)
# Model evaluation
lm_rmse = mean_squared_error(car_labels, car_price_prediction, squared=False)
lm_rmse
Out[109]:
67454.88485800906
Test Dataset¶
Feature Engineering of Categorical Features¶
In [121]:
Copied!
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
test_df = pd.read_csv('/kaggle/input/used-car-price-prediction-dataset-for-kagglex/test.csv')
In [122]:
Copied!
# Create brandmodel feature
test_df["brandmodel"] = test_df['brand']+ ' ' + test_df['model']
# Create engine additional features
test_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = test_df['engine'].apply(get_engine_features)
# Create transmission features
test_df['transmission'] = test_df['transmission'].apply(simple_transmission)
# Create brandmodel feature
test_df["brandmodel"] = test_df['brand']+ ' ' + test_df['model']
# Create engine additional features
test_df[['hp', 'ed', 'num_cylinders', 'num_valves']] = test_df['engine'].apply(get_engine_features)
# Create transmission features
test_df['transmission'] = test_df['transmission'].apply(simple_transmission)
Feature Engineering of Numerical Features¶
In [123]:
Copied!
# Replace milage with square root
test_df['milage']= test_df['milage'].apply(lambda x: np.sqrt(x))
lambda_ = 0.15
test_df['hp']= boxcox1p(test_df['hp'], lambda_)
test_df['ed']= boxcox1p(test_df['ed'], lambda_)
test_df['model_year'] = pd.cut(test_df['model_year'],
bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])
# Replace milage with square root
test_df['milage']= test_df['milage'].apply(lambda x: np.sqrt(x))
lambda_ = 0.15
test_df['hp']= boxcox1p(test_df['hp'], lambda_)
test_df['ed']= boxcox1p(test_df['ed'], lambda_)
test_df['model_year'] = pd.cut(test_df['model_year'],
bins=[1974, 2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024],
labels=[2007, 2011, 2013, 2015, 2016, 2018, 2019, 2020, 2021, 2024])
In [ ]:
Copied!
# Replacing model_year with quantiles
#test_df['model_year'] = pd.cut(test_df['model_year'],
# bins=[1974, 2012, 2016, 2019, 2024],
# labels=[2012, 2016, 2019, 2024])
# Replacing model_year with quantiles
#test_df['model_year'] = pd.cut(test_df['model_year'],
# bins=[1974, 2012, 2016, 2019, 2024],
# labels=[2012, 2016, 2019, 2024])
In [124]:
Copied!
# Plot model_year with quantiles
test_df["model_year"].hist(bins=10, figsize=(6,2))
plt.show()
# Plot model_year with quantiles
test_df["model_year"].hist(bins=10, figsize=(6,2))
plt.show()
In [125]:
Copied!
# Plot numerical features
test_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
# Plot numerical features
test_df[["model_year", "milage","hp", "ed", "num_cylinders", "num_valves"]].hist(bins=50, figsize=(8,8))
plt.show()
In [126]:
Copied!
# Remove undesired features
index_test = test_df['id']
index_test = index_test.to_frame()
test_df.drop(['id','engine','brand','model','num_valves'], axis=1, inplace=True)
#test_df.drop(['id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)
# Remove undesired features
index_test = test_df['id']
index_test = index_test.to_frame()
test_df.drop(['id','engine','brand','model','num_valves'], axis=1, inplace=True)
#test_df.drop(['id','brand','model','engine','hp','ed','num_valves','num_cylinders'], axis=1, inplace=True)
In [127]:
Copied!
# Total imputation
test_df = test_df.replace('–', np.nan)
# Total imputation
test_df = test_df.replace('–', np.nan)
In [128]:
Copied!
test_df.info()
test_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36183 entries, 0 to 36182 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model_year 36182 non-null category 1 milage 36183 non-null float64 2 fuel_type 35986 non-null object 3 transmission 36183 non-null object 4 ext_col 36156 non-null object 5 int_col 35479 non-null object 6 accident 36183 non-null object 7 clean_title 36183 non-null object 8 brandmodel 36183 non-null object 9 hp 33577 non-null float64 10 ed 35778 non-null float64 11 num_cylinders 35679 non-null float64 dtypes: category(1), float64(4), object(7) memory usage: 3.1+ MB
In [ ]:
Copied!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58282 entries, 0 to 58281
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model_year 58276 non-null category
1 milage 58282 non-null float64
2 fuel_type 57773 non-null object
3 transmission 58282 non-null object
4 ext_col 58226 non-null object
5 int_col 57104 non-null object
6 accident 58169 non-null object
7 clean_title 57686 non-null object
8 brandmodel 58282 non-null object
9 hp 53415 non-null float64
10 ed 57280 non-null float64
11 num_cylinders 57081 non-null float64
dtypes: category(1), float64(4), object(7)
memory usage: 4.9+ MB
In [134]:
Copied!
# Make predictions
#car_price_test_prediction = np.exp(pipeline.predict(test_df))
car_price_test_prediction = pipeline.predict(test_df)
car_price_test_prediction_df = pd.DataFrame(car_price_test_prediction, columns=['price'])
results = pd.concat([index_test, car_price_test_prediction_df], axis=1)
# Make predictions
#car_price_test_prediction = np.exp(pipeline.predict(test_df))
car_price_test_prediction = pipeline.predict(test_df)
car_price_test_prediction_df = pd.DataFrame(car_price_test_prediction, columns=['price'])
results = pd.concat([index_test, car_price_test_prediction_df], axis=1)
In [135]:
Copied!
results.head(3)
results.head(3)
Out[135]:
| id | price | |
|---|---|---|
| 0 | 54273 | 30081.982962 |
| 1 | 54274 | 19885.575096 |
| 2 | 54275 | 27483.291429 |
In [136]:
Copied!
# Save results to a csv file
results.to_csv('submission39-REPOFsubmission22.csv', index=False)
# Save results to a csv file
results.to_csv('submission39-REPOFsubmission22.csv', index=False)
In [ ]:
Copied!