Machine Learning Project 1 - Big Mart Sales


Aim:

Given sales data for 1559 products across 10 stores of the Big Mart chain in various cities. The task is to build a model to predict sales for each particular product in different stores.

Dataset Features: 

VariableDescription
Item_IdentifierUnique product ID
Item_WeightWeight of product
Item_Fat_ContentWhether the product is low fat or not
Item_Visibility% of total display area in store allocated to this product
Item_TypeCategory to which product belongs
Item_MRPMaximum Retail Price (list price) of product
Outlet_IdentifierUnique store ID
Outlet_Establishment_YearYear in which store was established
Outlet_SizeSize of the store
Outlet_Location_TypeType of city in which store is located
Outlet_TypeGrocery store or some sort of supermarket
Item_Outlet_SalesSales of product in particular store. This is the outcome
variable to be predicted.


Handling Missing Values:

2 features have missing values:
  • Item_Weight - 1463
        Imputing the missing values in "Item_Weight" column by the mean of item_weight of          each Item_type.
        eg: For "Dairy" item_type, if there are any missing values in item_weight                            corresponding to Dairy, then I will impute the mean to that missing value.
        This is checked and imputed for every unique Item_type.
  • Outlet_Size  -  2410
        Imputing the mode, i.e. 'Medium size' to the missing values.

Feature Engineering:

  1. Item_Fat_Content: 2 unique items. Misclassified labels corrected.
  2. Outlet_Establishment_Year: number of years calculated.
  3. Item_type_new: in "Item_Type", we have many unique items and in "Item_Identifier", all these unique items are divided into 3 groups, i.e. Food, Drinks and non-consumable. So, "Item_type_new" will contain these 3 groups.

Feature Transformations:

All the feature transformations were based on the Exploratory data analysis and Hypothesis testing done on this dataset. Follow here.
  1. Item_Weight:  divided into 2 groups,i.e. Weights 1 and Weights 2.
  2. Item_Visibility: divided into 2 groups,i.e. High and low.
  3. Item_MRP: kept as it is.
  4. Item_Fat_Content: 2 unique items, kept as it is.
  5. Outlet_Size:  divided into 2 groups,i.e. High and small.
  6. Outlet_Location_Type: divided into 2 groups,i.e. Tier 1 and Tier 2.
  7. Outlet_Type:  4 unique items, kept as it is.
  8. Item_type_new: divided into 2 groups,i.e. Drinks and others.
  9. Outlet_Establishment_Year: divided into 3 groups, i.e. low, medium and high.

NOTE: All the categorical features were applied with LabelEncoder() and dummy variables were created as well. 
Also, the dummy variable traps were avoided.

Model Selection:


GradientBoostingRegressor clearly outperformed every other models.
There is no evidence of "high bias" or "high variance". The model generalizes the new data well.


Best Features:
1. From the hypothesis testing post, we found out that Item_MRP has an correlation of about 57% with the response variable,i.e. for an increase in $1 in Item_MRP, outlet sales tends to increase by almost half. 
2. Among all Outlet_Types, 1 and 3 have the highest outlet sales. 
3. Outlet_Establishment_Year_1, i.e. "Medium number of years" has the lowest outlet sales compared to others.

Feature Selection:

L2-based Feature Selection considered only 4 features and it gave a better model than other feature selection techniques.
Since, the original number of features were 11, the complexity of the model were greatly reduced. But, still the baseline model performed better on the test-set by a small margin.
As we saw earlier, GradientBoosting with L2 based feature selection model has a nearly similar CV nad RMSE scores as that of the baseline model. So, again the learning curves shows no evidence of "high bias" or "high variance".

Best Features:
Again, Item_MRP has the best feature-importance score, followed by Outlet_types.
It is very interesting to see that using only Item_MRP's and the Outlet_Type's, we can predict Outlet_Sales. 


Model Tuning:

In this case, I will tune 2 models:
Model 1: Gradient Boosting Model -- with all features
Model 2: Gradient Boosting Model -- with L2 based feature selection


Model 1:

More importance is given to the "Item_MRP" feature, as it increased from ~58% (baseline model) to around 71% importance score. And, subsequently other feature's importance decreased. 
Also, comparing the CV and RMSE scores to the baseline model, the tuned model is slightly overfitting.


Model 2:


We get a much better distribution of feature importances in this tuned model, as all the predictors are contributing substantially.
Also, the learning curves shows that the model is generalizing the new data very well. 


Final Model:

"Tuned GradientBoostingRegressor - with L2-based Feature selection"

Reasons:

  1. Less number of features, less complexity.
  2. No sign of overfitting.
  3. Model is generalizing the new data very well.
  4. Although, the Public LB score is more than the baseline model, please note that this score is based on only 25% of the predicted test data. For the full set of the test data, the final model is perform almost similar or even better than the baseline model.

Comments

Popular posts from this blog

Fraud Detection in Financial Data

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Recurrent Neural Networks and LSTM explained