Data Analysis - Adult Income Dataset



This project is divided into 2 parts:
1. Data Analysis:
In this blog post I will show you how to analyze the Adult Income DataSet which contains income data for about 32500 people from the statistical point of view.

2. Model building:
In this part I will build machine learning models: Linear Model(LogisticRegression) and NonLinear model (RandomForest), which tries to predict if a person will make more or less than $50K a year.
Follow the link to Part 2: Click Here


Data Analysis:
Response Variable:
"Target": whether the person earns less or more than $50K a year.

Predictor Variables:
1. "WorkClass" : 
Insights:
1. Incorporated self employed group have higher chances of earning >50k ---> ~56%
2. never-worked and without-pay group has no chances of earning > 50k   ---> 0%
3. private sector has the highest percentage of ppl working in it, i.e. ~70%, but prob of getting >50k ---> ~22%
4. If working in federal-gov, prob of earning >50k --> ~39%, whereas if in state or local-gov -->
Feature Transformation: 'without-pay', 'never-pay' and '?' are combined together --> because they have no chance/very little change of getting >50K.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


2. "Education" :
Insights:
1. for doctorate, prof-school, masters and bachelors have higher probabilities of getting >50k, i.e. 74%, 73%, 56% and 41% respectively.
2. for Assoc-acdm & Assoc-voc, chances are about 25%.
3. Others have very low chances.
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


3. "Marital Status " :
Insights:
1. Married people (where couples live together) have better probabilities of earning >50k.
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


4. "Occupation" :
Insights:
1. Exec-managerial & Prof-specialty have better chances of earning >50k, i.e. 48% and 45%
2. Protective-serv & Tech-support --> 33% and 30%
3. Sales, craft-repair and transport --> 26%, 22% and 20% respectively.
4. others very low.
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


5. "Relationship" :
Insights:
1. If married, better chances of earning >50k
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


6. "Race" :
Insights:
1. if you are Asian-Pac-Islander or white, chances of getting >50k is about 26%.
2. others --> low
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


7."Sex" :
Insights:
1. If male, 30% chances of getting >50k, whereas for female it's just 11%.
Confirms gender inequality.
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


8. "Country" :
Insights:
1. If US, 25% chances of getting salary >50K, others 20%.
Feature Transformation: groups that have no chance/very little change of getting >50K are combined together.
Since, p-value < 0.05, we have a statistically significant mean difference between the groups.


9. "Age" :
Mean = 39, Standard Deviation = 13.6 
The distribution of Age columns is skewed towards right, i.e., as age increases the count of people decreases. 

From the above plot, people who earn more than 50K, on an average are older than the people who earn less than 50K.
This is our assumption.
And from the t-test, with p-value < 0.05, we conclude that the assumption is true.





10. "Capital Gain" :
Mean = 1078, Standard Deviation = 7385
Most of the people didn't have any kind of capital gains, and there are few people who have a very high amount of capital gains. 
There are no outliers visible from the boxplot. 
There is a huge difference in capital gains for people who earn >50K and <=50K.
Since, p-value < 0.05, we have a statistically significant group means difference.






11. "Capital Loss" :

Mean = 87, Standard Deviation = 403
Most of the people didn't have any kind of capital gains, and there are few people who have a very high amount of capital gains. 
There are no outliers visible from the boxplot. 


Again, p-value < 0.05, we have a statistically significant group means difference.





12. "Hours per week" :
Mean = 40.5, Standard Deviation = 12.3
On an average, most of the people work 40 hours a week. 
There are few people who works very less hours and some who works a lot of hours, and because of that we are getting some extreme outliers.




Again, p-value < 0.05, we have a statistically significant group means difference.




Outliers Detection:

1. "Age" :
As we saw earlier, the histogram showed us that the data was skewed towards the right and there were few outliers. 


After applying "log" transformation, we were able to reduce the skewness and as well as the outliers.

2. "Hours per week" :
There were some extreme outliers in the original data.
So now, taking 2nd degree standard deviation to be the threshold. Anything beyond that threshold is to be capped to the value of the threshold.

Conclusion:
Here, we have analyzed the Adults Income Dataset and saw how the features are distributed. 
Then, from the statistical point of view, we did hypothesis testing w.r.t the response variable and lastly, we took care of the outliers in the numerical features.


To view the full code for part 1 on Github, Click Here

Comments

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Exploratory Data Analysis and Hypothesis Testing - Loan Prediction

Recurrent Neural Networks and LSTM explained