Cluster Analysis - Project 1
In this project, we will mainly concentrate on
clustering customers from the YPedia homepage search data, find out various
insights and patterns of customer behavior and solve few business related
queries.
Section 1 : Handling Missing Data
Section 2 : Exploratory Data Ananlysis and Feature Engineering
Section 3 : 3.1
Find out Under-performing and
Out-performing Marketing Channels.
3.2 Perform
A/B Test on the Out-performing Channels.
Section 4 : K-means Clustering:
4.1 What is the
optimal number of Clusters for this data?
4.2 Descriptive
Analysis of all the Clusters in terms of booking rate.
4.3 What are the
important features that best describes 95% of the variance for each cluster
Section 5 : What
lead to a higher chance of booking for individuals in each Cluster?
About the Dataset:
The YPedia homepage search
data consists of 100,000 observations and 25 features. Since, some of the
features were not used in this project, therefore those features will not be
mentioned below.
The features set for this project
are as follows:
Features
|
Description
|
Userid
|
user id of each customer
|
Date_time
|
date and time of the search
|
Srch_ci
|
check-in date searched
|
Srch_co
|
check-out date searched
|
Channel
|
marketing channel used
[total 11 channels available]
|
Srch_adults_cnt
|
number of adults
|
Srch_children_cnt
|
number of children
|
Srch_room_cnt
|
number of rooms
|
Orig_destination_distance
|
distance from home to
destination
|
Is_mobile
|
searched from mobile phone?
[No: 0, Yes: 1]
|
Is_package
|
any package used? [No:0,
Yes: 1]
|
Is_booking
|
was booking done
successfully?[No: 0,Yes: 1]
|
SECTION 1:
Handling missing data:
There are few features with missing data. They
are:
Features
|
Number of missing data
|
Orig_destination_distance
|
36085
|
Srch_ci
|
122
|
Srch_co
|
122
|
Orig_destination_distance: There are a lot of data points missing for this feature (~36%). With a mean of 1960.662 and a median of 1131.835,this distribution is positively skewed.
So, it makes sense to impute the mean of this data to the missing values, so as to minimize
the difference between the mean and the median.
New mean and median are same, i.e. 1960.662.
Srch_ci and
Srch_co: These two features
represents dates, lets just remove them.
NOTE: After solving for missing data, our dataset now contains
99,878 rows and 12 features.
SECTION 2:
Exploratory Data Analysis:
1. Categorical
Values:
Is booking: only 8% of
the total searched data was converted.
Is mobile : only
13.3% of the searched data were done
by mobile apps.
Is package: only 24.8%
of the time a package was used in the search.
Channel: Marketing channel 9 was used the maximum number of times with 55.4%, followed by channel 0 and 1 with only 12.4% and 10.2% respectively.
Srch_adults_cnts: most of the time the adults count per searched results accounted for not more than 2, with 65.5% and 21.5% for 2 adults and 1 adult respectively. Followed by for 3 adults which is around 5.4% only.
Srch_children_cnt:
No-children = 78.7%, 1 children = 11.2%,
well this makes sense from the results we got from the adults count. They are
mostly young couples with 0 or 1 children.
Srch_rm_cnt: 91.6% of the time the room count was just 1, followed by 2
rooms for just 6.6%.
2. Numerical features:
Apart from Orig_destination_distance, two more features were created.
Duration : duration of stay [from the features “srch_ci” and “srch_co”]
Days_in_advance : number of days room searched in advance from the booking date [from features “srch_ci” and “Date_time”].
Apart from Orig_destination_distance, two more features were created.
Duration : duration of stay [from the features “srch_ci” and “srch_co”]
Days_in_advance : number of days room searched in advance from the booking date [from features “srch_ci” and “Date_time”].
Feature Transformations: The numerical
features have a lot of extreme/outlier values, as can be seen from the
boxplots. Since these outliers are only present to the top half, lets impute the values which exceeds 95th percentile, i.e. keep
values between [0th – 95th percentile] and impute the
exceeding values to the value in the 95th percentile.
SECTION 3:
3.1 “Find out Under-performing and
Out-performing Marketing Channels.”
Description:
Sub_average : booking rate of that particular channel.
Rest_average : booking rate of all other channels combined.
Ttest : two sample t-test value
Prob : probability obtained from Cumulative
Distribution Function
Significant : If the probability > 0.9, we can conclude
that this particular channel outperforms other channels combined, i.e. we are more than
90% confident.
If the probability
< 0.1, we can conclude that this channel underperforms than the other
channels combined, i.e. again we are 90% confident that it underperforms.
If 0.1 <=
probability <= 0.9, we cannot conclude anything statistically.
3.2 “Perform
A/B Test on the Out-performing Channels”
SECTION 4:
4.1 “What is the optimal number of Clusters for this data?”
4.1 “What is the optimal number of Clusters for this data?”
From
the WCSS (Within Cluster Sum of Squares), the first 3 clusters have large
distances between them and as we move forward, the value gets minimized. So for this data, the optimal
number of clusters is set to 3.
4.2 “Descriptive Analysis of all the Clusters.”
4.3 “What are the important features that best describes 95% variance in booking for each cluster?”
NOTE: The Feature Importances scores were computed using Random Forest. The black vertical line represents 95% variance in scores.
“What leads to a higher chance of booking for individuals in each Cluster?”
Since,
Cluster 0 mainly consists of families who travels in groups, as can be seen
from the number of adults, children and room counts, we do not get any kind
of information for individuals in our tree.
Conclusion:
In
Section 1, we handled missing values in our dataset.
In
Section 2, we did some Exploratory Data Analysis and Feature Engineering on our
Categorical and Numerical features. We saw how each features are distributed
among customers.
In
Section 3, we performed a Statistical test to find out Under-performing and
Out-performing Marketing Channels from our data.
Later,
we did A/B test with Hierarchical Modelling on our Out-performing Marketing
Channels. We sampled each channel from our distribution to find out the
probabilities of booking.
In
Section 4, we used Within Cluster Sum of Squares (WCSS) to find the optimal
number of clusters for this data and used KMeans clustering from SCIKIT-Learn
module to cluster the data.
Then,
we described each cluster in terms of booking rate and found out how the
features of each cluster affects the booking rate.
Lastly,
using RandomForestClassifier() from SCIKIT-learn module, we calculated the
feature importances of each cluster. And then, using LogisticRegression, we
calculated the coefficients and Odds-Ratios of each features.
In
Section 5, using DecisionTreeClassifier() from SCIKIT-learn module, we
generated trees for each cluster and examined the factors that leads to higher
booking rate for Individual customers.
References:
1. A/B
Test : https://blog.dominodatalab.com/ab-testing-with-hierarchical-models-in-python/
To view the full code for this project, Click Here
Comments
Post a Comment