Posts

Showing posts from November, 2017

Machine Learning Project 3 - Credit Card Fraud Detection

Image
Aim: The goal of this project is to automatically identify fraudulent credit card transactions using Machine Learning. This is a binary classification problem. My approach is explained below: Workflow: 1. Check the distribution of the classes in the response variable - whether it is an imbalanced or a balanced dataset. 2. Create a baseline model (LogisticRegression), and  check recall value for fraudulent transaction class :     a. if low recall value, then solve for the imbalanced data.         i. Under-sampling        ii. Over-sampling methods.     b. if high recall value, move forward. 3.  Model Selection: train with other models and select the best one. 4.  Feature Selection: apply feature selection techniques to select the best features. 5. Final model Model Evaluation methods: 1. Recall Values 2. Precison Values 3. Area under curve 1. Class Distribution: This is an imbalanced dataset. There are total 2,84,807 number of transaction

Machine Learning Project 2 - Adult Income Data

Image
This blog post contains the 2nd part of this project, i.e. Model Building . Here 2 models will be built: Linear Model (i.e. Logistic Regression) and Non-Linear Model (i.e. Random Forest) Follow the 1st part, i.e Data Analysis  -->  Click Here Model Building: 1. Linear Model Logistic Regression: If we look at the overall F1-score, this linear model managed to achieve 0.85, which is good. But, the gap between the recall values is a lot. LogisticRegression managed to perform far better in predicting class type 0 (i.e. <=50K) with 0.93, but for class type 1 (i.e. >50K) it is very poor (only 0.59). Lets look at the coefficients: The top 3 features that contributes positively are: 1. Capital Gain 2. Education 3. Age The top 3 features that contributes negatively are: 1. Marital Status - Others 2. Occupation - Others 3. Sex - Female Quick recap from the Data Analysis post: Positive contributions: 1. Capital Gain - If capital gains present, then hig

Data Analysis - Adult Income Dataset

Image
This project is divided into 2 parts: 1. Data Analysis: In this blog post I will show you how to analyze the  Adult Income DataSet which contains income data for about 32500 people from the statistical point of view. 2. Model building: In this part I will build machine learning models: Linear Model(LogisticRegression) and NonLinear model (RandomForest ), which tries to predict if a person will make more or less than $50K a year. Follow the link to Part 2:  Click Here Data Analysis: Response Variable: "Target": whether the person earns less or more than $50K a year. Predictor Variables: 1. "WorkClass" :   Insights: 1. Incorporated self employed group have higher chances of earning >50k ---> ~56% 2. never-worked and without-pay group has no chances of earning > 50k   ---> 0% 3. private sector has the highest percentage of ppl working in it, i.e. ~70%, but prob of getting >50k ---> ~22% 4. If working in federal-gov, prob