This is the group project of SDSC2102 – Statistical Methods and Data Analysis. I did the project in my year 3 2022/23 Semester B.
Presentation Slides:
Course Instructor: Prof. ZENG Li
1. Introduction (Background and Problem Formulation)
1.1 Background
A Chinese automobile company planned to penetrate the US market by building a factory within the country to produce cars locally. In order to achieve higher competitiveness against its US and European counterparts, the company intends to adapt its car designs and business strategies to meet specific price levels.
1.2 Objectives
One of the key objectives is to predict reasonable prices to make their products more appealing to consumers in the US market.
1.3 Target
The primary goal of this analysis is to understand the factors influencing car pricing in the US market. By identifying these factors, the company can make informed decisions about car designs and pricing strategies. Additionally, this study aims to compare the performance of three regression models, including Linear Regression, Random Forest Regression, and KNN Regression, to determine the best model for predicting suitable car prices for the car company.
1.4 Methodology
The investigation will employ three modeling techniques: Linear Regression, Random Forest Regression, and KNN Regression, to identify the factors influencing car pricing in the US market and predict reasonable prices for the company’s products. Kaggle, Tableau, and Python will be the primary tools for this data analysis and interpretation.
2. Data Preprocessing
2.1 Importing Dataset and Handling Missing Values
To begin the data preprocessing, we first imported the dataset using pandas and checked for any missing values. In this case, there were no missing values found in the dataset.
2.2 Data Cleaning
In this step, we modified the column name “CarName”’ to “CompanyName” and ensured all company name typos were corrected. After that, we checked for any duplicate entries not found in this dataset either.
2.3 Data Visualization
We created distribution and box plots for car prices to understand the data better. We saw that most car prices are around $10,000 (Appendix 1). Next, we plotted a correlation heatmap for 14 numerical variables, including “price,” to identify the top 10 attributes with the highest correlation (Appendix 2). We saw that variables like “enginesize” positively correlate with price (0.87), while “citympg” and “price” are negatively correlated.
2.4 Feature Selection
Based on the correlation heatmap, we selected those six features to be further analyzed. Four features positively correlate with price (“enginesize,” “curbweight,” “horsepower,” “carwidth”), and two features negatively correlate with price (“citympg,” “highwaympg”). We then plot the variation of car prices against these six selected features for better visualization (Appendix 3).
2.5 Handling Categorical Variables
We further plotted some graphs on the dataset, making us notice too many categorical items (Appendix 4). They were not convenient for modeling if we did not take action to tackle them. So in order to solve the problems, we handled those dummy variables by hot-encoding categorical columns into the numerical column using the Pandas’ get_dummies function.
2.6 Creating New Variable and Classifying the Car Companies
We created a new variable, “fueleconomy,” to represent the fuel efficiency of each car. We also grouped the car companies based on the average prices of each company, categorizing them as “Budget,” “Medium,” or “Highend”.
2.7 Finalizing the Dataset for Prediction
Finally, we created a dataset for prediction, which includes only the important variables identified in the previous steps. At the end of the preprocessing steps, we have a clean and structured dataset ready for later modeling and prediction tasks (Appendix 5).
3. Modeling
3.1 Evaluation Metrics
The metrics we used to evaluate the performance of our models are R-squared (R²), Adjusted R-squared, Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Error Percentage (MAPE).
The R-squared (R²) metric quantifies the amount of variation in the dependent variable that is accounted for by the independent variables. The R² value ranges from 0 to 1, with 1 indicating a perfect fit. Generally, a higher R² value is more favorable, although other considerations such as model complexity and sample size need to be considered as well.
The Adj R² is a modified form of R² such that it adjusts for the number of independent variables in the model. The Adj R² ranges from -∞ to 1, with 1 indicating a perfect fit. In general, R² does not decrease when a new variable is added to the model, however, the Adj R² increases only when the newly added independent variables improve the model fit more than would be predicted by chance alone, and it decreases when the new independent variables do not add sufficient improvement into the model fit. Hence, Adj R² is a better metric than R² for comparing models with different numbers of independent variables.
The Mean Squared Error (MSE) assesses the average squared difference between the observed and predicted car prices. The MSE is a risk function that is expressed in the squared units of the dependent variable. MSE measures the quality of an estimator.
The Mean Absolute Error (MAE) is a metric that calculates the average absolute difference between the observed and predicted car prices. The MAE is measured in the same units as the dependent variable.
Mean Absolute Percentage Error (MAPE) measures the average percentage difference between the observed and predicted car prices. MAPE represents the magnitude of error towards the actual value, which is expressed in percentage.
MSE, MAE, and MAPE value shows a better model if their value is low since it indicates that the model’s car price predictions are closer to the actual car prices.
3.2 Linear & Multilinear Regression
3.2.1 Linear Regression Model
Ordinary least squares (OLS) Linear Regression is a method that fits a linear model by determining coefficients w = (w1, …, wp) in order to minimize the discrepancy between the actual target values in the dataset and the values predicted by the linear equation. This approach seeks to reduce squared residuals, representing the differences between observed and predicted targets. Many benefits can be obtained by using the linear regression model from the sklearn.linear_model.LinearRegression, such as parameters optimization and building predictive multiple linear models.
3.2.2 Hypothesis Testing
In order to examine the relationship between the dependent variable (Price) and the independent variable (Factors), a hypothesis test was conducted. The significance level was set at 0.05, with the null hypothesis (H0) stating that there is no relationship between Price (Y) and Factors (X) and the alternative hypothesis (H1) asserting that a relationship exists between Y and X. If the p-value is less than 0.05, the null hypothesis will be rejected, leading to the conclusion that a relationship exists between Y and X. Additionally, the R-squared value will be assessed to determine the strength of the effect of Factors (X) on Price (Y) within the context of this report.
3.2.3 Results of Linear Regression
At the outset, we established linear regression models with all independent variables (X), such as “enginesize,” “horsepower,” “curbweight,” and other independent variables, with price as the dependent variable (Y). The resulting graph provides an overview of the outcomes (Appendix 6).
To conduct the analysis, we initially identified the factors with a p-value greater than 0.05, indicating no correlation with price. As illustrated in Appendix 7, variables from “carheight” to “compression ratio” demonstrate no association with the price. Consequently, we rejected them and focused solely on the remaining factors presenting a p-value less than 0.05. As depicted in Appendix 8, variables ranging from “enginesize” to rear-wheel drive (RWD) correlate with price. Appendix 9 demonstrates that “four” car doors and “fueleconomy” were negatively correlated with price, while Appendix 10 reveals that the remaining factors positively correlate with price. After comparing the R-squared values, the linear regression model identified nine attributes with the highest absolute correlation with price. The p-values for all variables are less than 0.05, indicating that all variables are statistically significant in predicting car prices. This suggests that including these variables in the model is appropriate as they each contribute to the model’s ability to predict car prices.
The variables with the highest R-squared values and lowest p-values are “enginesize,” “curbweight,” “horsepower,” and “highend.” These variables are likely the most important predictors of car price according to this model. Other variables such as “carwidth,” “four,” “fueleconomy,” “carlength,” and “RWD” have weaker relationships with car prices but are still statistically significant. Overall, this model appears to be effective in car price prediction by using a combination of variables that are statistically significant.
3.2.4 Results of Multiple Linear Regression
Multiple linear regression is a technique to analyze the linear relationship between two or more independent variables and one continuous dependent variable. The equation takes the form of y = bo + b1x1 + b2x2 + … + bnxn, where y is the dependent variable, x1, x2, …, xn are the independent variables, and b0, b1, b2, …, bn are the coefficients that represent the slope of the line for each independent variable. This model combines nine significant variables found during the individual linear regression into one equation to predict the car prices (response variable). Through the model fitting, it is found that “Highend” is the largest coefficient, indicating that it has the most significant impact in this car price prediction model.
Appendix 13 illustrates the scatter plot of the actual and predicted prices by this model. Overall, the results in Appendix 12 indicate that the multiple linear regression model shows good predictive abilities, with some variations in performance across the training, cross-validation, and testing sets.
3.3 Random Forest Regression
A random forest acts as a composite predictor, employing multiple decision trees, each trained on distinct data subsets, to boost prediction accuracy and combat overfitting. When bootstrap is set to True (default), the max_samples parameter dictates the size of each subgroup; otherwise, the model uses the entire dataset to build each tree.
Before obtaining the results, the model parameters were optimized using GridSearchCV with a 5-fold cross-validation method. The current model employs 61 estimators and requires a minimum of 6 sample splits. Using 61 estimators, the model incorporates 61 separate decision tree regressions, each utilizing varying bootstrapped samples. Lastly, other hyperparameters remain at the default of the model.
The results from the model are finalized after averaging the results from each tree. The study divided the dataset using a 5-fold cross-validation method, creating separate training and testing sets. We employed an out-of-box (OOB) approach for predictions, which involved making predictions on data points not used in constructing the individual decision trees within the random forest model.
Meanwhile, the other general evaluation metrics of the Random Forest Regression model are presented in Appendix 14. The testing set shows a good result with the adjusted R² value of 0.8861. Overall, these results indicate that the random forest model demonstrates strong predictive capabilities and generalizes well to new data compared to Multiple Linear Regression.
Appendix 15 demonstrates the comparison of actual and predicted price values done by the model. We can see that more data are below $25,000. We believed that in order for the model to perform better, we need more sources from higher prices, especially the one that ranges above $30,000.
Appendix 16 represents the Feature Importance done by Gini Importance, which shows that the most impactful variable is the “enginesize,” which makes sense as the engine is the most important part of the car. Cars can not run without the engine, and another reason would be that the different car capabilities, including speed and lifespan, mostly depends on the “engine size.”
Lastly, Appendix 17 represents the visualization of the tree plot, which is limited by depth 2. In reality, the tree on the left has a max depth of 9, and total nodes of 67, and the tree on the right has a max depth of 10 and total nodes of 69. Appendix 18 shows the histograms to visualize the distribution of the number of nodes, max depth, and the number of leaves. The distribution was created based on the different individual regression trees inside the random forest based on the bootstrapping method. Results show that bootstrapping takes a huge role to make the trees more diverse and apply the training dataset to help with the prediction.
3.4 K-Nearest Neighbor Regression
K-Nearest Neighbor Regression is a type of instance-based learning to predict numerical values of new data points by evaluating its ‘k’ neighbors to find similarities using the KNeighborsRegressor() function from the sklearn library. This means a new point is assigned a value based on how closely it resembles the points in the training set. In this model, n_neighbor = 4 is used as a parameter, acquired from comparing R² from different k. Appendix 20 depicts the scatterplot of actual and predicted prices using KNN regression.
Appendix 19 describes the general evaluation metrics used in the KNN Regression model. Overall, these results indicate that this model can predict the car price effectively, with only slight variations in performance between training, testing, and cross-validation sets. Additionally, the accuracy of KNN regression is higher than the multiple linear regression model. Appendix 21 visualizes the contour plot of KNN for the 2 most significant attributes, engine size and curb weight with the warmer color of the plot indicating a higher price.
4. Interpretation and Reflection
The analysis compared the performance of three different regression models: Multiple Linear Regression, Random Forest Regression, and K-Nearest Neighbor Regression. All models showed slight overfitting problems, which is shown by the notably higher accuracy on the training set compared to the testing and validation sets. This problem might be caused by the small sample size and lack of diversity in the data, leading to difficulties in generalization and adaptability to new data. Based on the Adjusted R² scores and the mean absolute error (MAE) values, the Random Forest Regression model emerged as the best-performing model, with an Adjusted R² of 0.8861 and an MAE of 1795.45. This indicates that the Random Forest model is better at explaining the variance in the data and has a lower average absolute difference between predicted and actual values compared to the other models.
Furthermore, the Random Forest Regression model provided feature importance scores, which revealed the top three most influential features as Engine Size (0.578), Curb Weight (0.192), and HorsePower (0.031). These scores demonstrate the relative contribution of each feature to the prediction across all decision trees in the forest. Consequently, it can be concluded that the Random Forest Regression model not only performs better in terms of prediction accuracy but also offers insights into the most significant features driving the predictions.
Throughout this project, we successfully predicted car prices using various regression models, achieving a high level of accuracy. This analysis provides valuable insights for car companies, dealerships, and buyers in determining the fair market value of current car prices. It also assists the Chinese car company in understanding the factors that influence car pricing, enabling them to become more competitive in the US market.
In conclusion, this study showcases the practical application of the statistical methods learned in the SDSC2102 course. By working with a real-world dataset, we were able to demonstrate the power of these techniques in solving complex problems and generating valuable insights for businesses and consumers alike.
Appendix
Python code used throughout this project – 2102 project.ipynb
Appendix 1 – Distribution and box plot for car price
Appendix 2 – Correlation heatmap for most of the variables
Appendix 3 – The variation of car price vs selected features
Appendix 4 – Others graph made with categorical items only
Appendix 5 – Encoded categorical columns with one hot-encoding method
Appendix 6 – Overall results of the regression model
Appendix 7 – Linear regression with no correlation factors
Appendix 8 – Linear regression of correlated factors with price
Appendix 9 – Linear regression of factors with negative correlation
Appendix 10 – Linear regression of factors with positive correlation
Appendix 11 – 9 Attributes with The Highest Absolute Correlation with Price
Appendix 12 – Multiple Linear Regression Evaluation Metrics
Evaluation Metrics | Training Set | Testing Set | 5-Fold Cross Validation Set |
---|---|---|---|
R² | 0.9303 | 0.8448 | 0.9096 |
Adjusted R² | 0.9262 | 0.8358 | 0.9043 |
Mean Square Error (MSE) | 4,181,836.2920 | 12,011,794.4727 | 5,252,029.2859 |
Mean Absolute Error (MAE) | 1,475.8022 | 2,387.4288 | 1,642.2066 |
Mean Absolute Percentage Error (MAPE) | 11.0414% | 16.6957% | 12.1913% |
Appendix 13 – Multiple Linear Regression Scatter Plot of Actual vs Predicted Prices
Appendix 14 – Random Forest Regression Evaluation Metrics
Evaluation Metrics | Training Set | Out-of-Box | 5-fold Cross Validation Set | Testing Set |
---|---|---|---|---|
R² | 0.9808 | 0.9365 | 0.9198 | 0.9117 |
Adjusted R² | 0.9797 | 0.9328 | 0.9181 | 0.8861 |
Mean Square Error (MSE) | 1,150,714.4524 | 3,756,219.2979 | 4,566,369.2124 | 6,835,687.8587 |
Mean Absolute Error (MAE) | 765.6409 | 1,421.0788 | 1,509.2703 | 1,795.4514 |
Mean Absolute Percentage Error (MAPE) | 5.7060% | 10.7757% | 11.3621% | 14.0521% |
Appendix 15 – Random Forest Scatter Plot of Actual vs Predicted Prices
Appendix 16 – Random Forest Feature Importance
Appendix 17 – Random Forest Individual Decision Tree Visualization
Appendix 18 – Random Forest Bootstrapping Statistics
Appendix 19 – K-Nearest Neighbor Regression Evaluation Metrics
Evaluation Metrics | Training Set | Testing Set | 5-Fold Cross Validation Set |
---|---|---|---|
R² | 0.9481 | 0.8898 | 0.8985 |
R² Adjusted | 0.9451 | 0.8833 | 0.8925 |
Mean Square Error (MSE) | 3,115,841.6219 | 8,528,717.5061 | 6,207,117.3859 |
Mean Absolute Error (MAE) | 1,205.7317 | 1,991.5976 | 1,684.7972 |
Mean Absolute Percentage Error (MAPE) | 9.2461% | 14.4534% | 12.4250% |
Appendix 20 – KNN Regression Scatter Plot of Actual vs Predicted Values
Appendix 21 – KNN Contour Plot