Machine Learning Project 2 - Adult Income Data



This blog post contains the 2nd part of this project, i.e. Model Building.
Here 2 models will be built: Linear Model (i.e. Logistic Regression) and Non-Linear Model (i.e. Random Forest)
Follow the 1st part, i.e Data Analysis  --> Click Here

Model Building:
1. Linear Model

Logistic Regression:
If we look at the overall F1-score, this linear model managed to achieve 0.85, which is good.

But, the gap between the recall values is a lot. LogisticRegression managed to perform far better in predicting class type 0 (i.e. <=50K) with 0.93, but for class type 1 (i.e. >50K) it is very poor (only 0.59).

Lets look at the coefficients:
The top 3 features that contributes positively are:
1. Capital Gain
2. Education
3. Age

The top 3 features that contributes negatively are:
1. Marital Status - Others
2. Occupation - Others
3. Sex - Female

Quick recap from the Data Analysis post:
Positive contributions:
1. Capital Gain - If capital gains present, then higher chances of getting >50K.
2. Education    -  higher the education, higher the chances of getting >50K.
3. Age              - higher the age, higher the chances of getting >50K.

Negative Contributions:
1. Marital Status - Others : This group has no chance/very little chance of getting >50K.
2. Occupation - Others    :  no chance/very little chance of getting >50K.
3. Sex - Female               : low chance of getting >50K as compared to Males. Hence Sex-Male has a positive coefficient.

2. Non-Linear Model

Random Forest:

This non-linear model managed to achieve F1-score of 0.96, which is fantastic.
From the recall values, it is not only predicting the class type 0 nicely with 0.98, also managed to predict class type 1 (i.e. 0.89) far better than the linear model.
Lets look at the Feature Importances:

Features with the most Importance score:
1. Age
2. Education
3. Capital Gain
4. Hours per week

NOTE: As we saw from the data analysis part, higher the numbers in these features, higher the chances of getting >50K.

Comparing Classifiers:

Conclusion:
The non-linear model (i.e. RandomForest) performs far better than the linear model (i.e. LogisticRegression) when it comes to predicting both the classes of the Target variable, as can be seen from the recall-values.
From the ROC curves, we get a nearly perfect curve for RandomForest.


To view the full code on Github --> Click Here

Comments

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Recurrent Neural Networks and LSTM explained

Exploratory Data Analysis and Hypothesis Testing - Loan Prediction