Cluster Analysis - Project 1



Problem Statement:   
In this project, we will mainly concentrate on clustering customers from the YPedia homepage search data, find out various insights and patterns of customer behavior and solve few business related queries.
Section 1 : Handling Missing Data
Section 2 : Exploratory Data Ananlysis and Feature Engineering
Section 3 :   3.1   Find out Under-performing and Out-performing Marketing Channels.
                        3.2   Perform A/B Test on the Out-performing Channels.
     Section 4 : K-means Clustering:
                4.1 What is the optimal number of Clusters for this data?
                4.2 Descriptive Analysis of all the Clusters in terms of booking rate.
                4.3 What are the important features that best describes 95% of the variance for                     each cluster
          Section 5 : What lead to a higher chance of booking for individuals in each Cluster?


About the Dataset:
The YPedia homepage search data consists of 100,000 observations and 25 features. Since, some of the features were not used in this project, therefore those features will not be mentioned below.

The features set for this project are as follows:
               Features                  
Description
Userid
user id of each customer
Date_time
date and time of the search
Srch_ci
check-in date searched
Srch_co
check-out date searched
Channel
marketing channel used [total 11 channels available]
Srch_adults_cnt
number of adults
Srch_children_cnt
number of children
Srch_room_cnt
number of rooms
Orig_destination_distance
distance from home to destination
Is_mobile
searched from mobile phone? [No: 0, Yes: 1]
Is_package
any package used? [No:0, Yes: 1]
Is_booking
was booking done successfully?[No: 0,Yes: 1]

SECTION 1:
Handling missing data:
There are few features with missing data. They are:
Features
Number of missing data
Orig_destination_distance
36085
Srch_ci
122
Srch_co
122


Orig_destination_distance: There are a lot of data points missing for this feature (~36%). With a mean of 1960.662 and a median of 1131.835,this distribution is positively skewed.
So, it makes sense to impute the mean of this data to the missing values, so as to minimize 
the difference between the mean and the median. 
New mean and median are same, i.e. 1960.662.

Srch_ci and Srch_co: These two features represents dates, lets just remove them.

NOTE: After solving for missing data, our dataset now contains 99,878 rows and 12 features.

SECTION 2:
Exploratory Data Analysis:
     1.  Categorical Values:
      
     Is booking: only 8% of the total searched data was converted.
     Is mobile  : only 13.3% of the searched data were done by mobile apps.
     Is package: only 24.8% of the time a package was used in the search.

     Channel: Marketing channel 9 was used the maximum number of times with 55.4%,   followed by channel 0 and 1 with only 12.4% and 10.2% respectively.
     Srch_adults_cnts: most of the time the adults count per searched results accounted for not more than 2, with 65.5% and 21.5% for 2 adults and 1 adult respectively. Followed by for 3 adults which is around 5.4% only.

     Srch_children_cnt: No-children = 78.7%, 1 children = 11.2%, well this makes sense from the results we got from the adults count. They are mostly young couples with 0 or 1 children.
Srch_rm_cnt: 91.6% of the time the room count was just 1, followed by 2 rooms for just 6.6%.

2. Numerical features:  
Apart from Orig_destination_distance, two more features were created.
Duration :  duration of stay [from the features “srch_ci” and “srch_co”]
Days_in_advance : number of days room searched in advance from the booking date [from features “srch_ci” and “Date_time”].

Feature Transformations:   The numerical features have a lot of extreme/outlier values, as can be seen from the boxplots. Since these outliers are only present to the top half, lets impute the values which exceeds 95th percentile, i.e. keep values between [0th – 95th percentile] and impute the exceeding values to the value in the 95th percentile.

SECTION 3:                    
              3.1  “Find out Under-performing and Out-performing Marketing Channels.”
Description:
Sub_average :  booking rate of that particular channel.
Rest_average : booking rate of all other channels combined.
Ttest :              two sample t-test  value
Prob :               probability obtained from Cumulative Distribution Function
Significant :      If the probability > 0.9, we can conclude that this particular channel outperforms other channels combined, i.e. we are more than 90% confident.
If the probability < 0.1, we can conclude that this channel underperforms than the other channels combined, i.e. again we are 90% confident that it underperforms.   

If 0.1 <= probability <= 0.9, we cannot conclude anything statistically.

        3.2   “Perform A/B Test on the Out-performing Channels”



SECTION 4:
4.1  “What is the optimal number of Clusters for this data?”
From the WCSS (Within Cluster Sum of Squares), the first 3 clusters have large distances between them and as we move forward, the value gets minimized. So for this data, the optimal number of clusters is set to 3.

4.2  “Descriptive Analysis of all the Clusters.”







4.3  “What are the important features that best describes 95% variance in booking for each cluster?”
NOTE: The Feature Importances scores were computed using Random Forest. The black vertical line represents 95% variance in scores.



Section 5:            
        “What leads to a higher chance of booking for individuals in each Cluster?”       


From Section 4.2, we saw that the percentage of Individual customers (i.e. Adults = 1) is about 30%, whereas the probability of booking is very low.
Since, Cluster 0 mainly consists of families who travels in groups, as can be seen from the number of adults, children and room counts, we do not get any kind of information for individuals in our tree.






























Conclusion:
In Section 1, we handled missing values in our dataset.
In Section 2, we did some Exploratory Data Analysis and Feature Engineering on our Categorical and Numerical features. We saw how each features are distributed among customers.
In Section 3, we performed a Statistical test to find out Under-performing and Out-performing Marketing Channels from our data.
Later, we did A/B test with Hierarchical Modelling on our Out-performing Marketing Channels. We sampled each channel from our distribution to find out the probabilities of booking.
In Section 4, we used Within Cluster Sum of Squares (WCSS) to find the optimal number of clusters for this data and used KMeans clustering from SCIKIT-Learn module to cluster the data.
Then, we described each cluster in terms of booking rate and found out how the features of each cluster affects the booking rate.
Lastly, using RandomForestClassifier() from SCIKIT-learn module, we calculated the feature importances of each cluster. And then, using LogisticRegression, we calculated the coefficients and Odds-Ratios of each features.
In Section 5, using DecisionTreeClassifier() from SCIKIT-learn module, we generated trees for each cluster and examined the factors that leads to higher booking rate for Individual customers.

References:
1. A/B Test : https://blog.dominodatalab.com/ab-testing-with-hierarchical-models-in-python/


To view the full code for this project, Click Here

Comments

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Recurrent Neural Networks and LSTM explained

Exploratory Data Analysis and Hypothesis Testing - Loan Prediction