Machine Learning Project 1

Machine Learning Project 1 - Big Mart Sales

By Dhrubajit Das October 23, 2017

Aim:

Given sales data for 1559 products across 10 stores of the Big Mart chain in various cities. The task is to build a model to predict sales for each particular product in different stores.

Dataset Features:

Variable	Description
Item_Identifier	Unique product ID
Item_Weight	Weight of product
Item_Fat_Content	Whether the product is low fat or not
Item_Visibility	% of total display area in store allocated to this product
Item_Type	Category to which product belongs
Item_MRP	Maximum Retail Price (list price) of product
Outlet_Identifier	Unique store ID
Outlet_Establishment_Year	Year in which store was established
Outlet_Size	Size of the store
Outlet_Location_Type	Type of city in which store is located
Outlet_Type	Grocery store or some sort of supermarket
Item_Outlet_Sales	Sales of product in particular store. This is the outcome variable to be predicted.

Handling Missing Values:

2 features have missing values:

Item_Weight - 1463

Imputing the missing values in "Item_Weight" column by the mean of item_weight of each Item_type.

eg: For "Dairy" item_type, if there are any missing values in item_weight corresponding to Dairy, then I will impute the mean to that missing value.

This is checked and imputed for every unique Item_type.

Outlet_Size - 2410

Imputing the mode, i.e. 'Medium size' to the missing values.

Feature Engineering:

Item_Fat_Content: 2 unique items. Misclassified labels corrected.
Outlet_Establishment_Year: number of years calculated.
Item_type_new: in "Item_Type", we have many unique items and in "Item_Identifier", all these unique items are divided into 3 groups, i.e. Food, Drinks and non-consumable. So, "Item_type_new" will contain these 3 groups.

Feature Transformations:

All the feature transformations were based on the Exploratory data analysis and Hypothesis testing done on this dataset. Follow here.

Item_Weight: divided into 2 groups,i.e. Weights 1 and Weights 2.
Item_Visibility: divided into 2 groups,i.e. High and low.

Item_MRP: kept as it is.
Item_Fat_Content: 2 unique items, kept as it is.
Outlet_Size: divided into 2 groups,i.e. High and small.
Outlet_Location_Type: divided into 2 groups,i.e. Tier 1 and Tier 2.
Outlet_Type: 4 unique items, kept as it is.
Item_type_new: divided into 2 groups,i.e. Drinks and others.
Outlet_Establishment_Year: divided into 3 groups, i.e. low, medium and high.

NOTE: All the categorical features were applied with LabelEncoder() and dummy variables were created as well.

Also, the dummy variable traps were avoided.

Model Selection:

GradientBoostingRegressor clearly outperformed every other models.

There is no evidence of "high bias" or "high variance". The model generalizes the new data well.

Best Features:
1. From the hypothesis testing post, we found out that Item_MRP has an correlation of about 57% with the response variable,i.e. for an increase in $1 in Item_MRP, outlet sales tends to increase by almost half.
2. Among all Outlet_Types, 1 and 3 have the highest outlet sales.
3. Outlet_Establishment_Year_1, i.e. "Medium number of years" has the lowest outlet sales compared to others.

Feature Selection:

L2-based Feature Selection considered only 4 features and it gave a better model than other feature selection techniques.

Since, the original number of features were 11, the complexity of the model were greatly reduced. But, still the baseline model performed better on the test-set by a small margin.

As we saw earlier, GradientBoosting with L2 based feature selection model has a nearly similar CV nad RMSE scores as that of the baseline model. So, again the learning curves shows no evidence of "high bias" or "high variance".

Best Features:
Again, Item_MRP has the best feature-importance score, followed by Outlet_types.
It is very interesting to see that using only Item_MRP's and the Outlet_Type's, we can predict Outlet_Sales.

Model Tuning:

In this case, I will tune 2 models:
Model 1: Gradient Boosting Model -- with all features
Model 2: Gradient Boosting Model -- with L2 based feature selection

Model 1:

More importance is given to the "Item_MRP" feature, as it increased from ~58% (baseline model) to around 71% importance score. And, subsequently other feature's importance decreased.

Also, comparing the CV and RMSE scores to the baseline model, the tuned model is slightly overfitting.

Model 2:

We get a much better distribution of feature importances in this tuned model, as all the predictors are contributing substantially.
Also, the learning curves shows that the model is generalizing the new data very well.

Final Model:

"Tuned GradientBoostingRegressor - with L2-based Feature selection"

Reasons:

Less number of features, less complexity.
No sign of overfitting.
Model is generalizing the new data very well.
Although, the Public LB score is more than the baseline model, please note that this score is based on only 25% of the predicted test data. For the full set of the test data, the final model is perform almost similar or even better than the baseline model.

Search This Blog

Gateway to AI