Fraud Detection in Financial Data
AIM:
The aim of this project is to detect fraudulent transactions in the Financial data.
About the Dataset:
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This synthetic dataset is scaled down 1/4 of the original dataset and it is available in Kaggle.
Features:
Features:
1. step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
2. type - type of transaction : CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
3. amount - amount of the transaction in local currency.
4. nameOrig - customer who started the transaction
5. oldbalanceOrg - initial balance before the transaction
6. newbalanceOrig - new balance after the transaction
7. nameDest - customer who is the recipient of the transaction
8. oldbalanceDest - initial balance recipient before the transaction.
9. newbalanceDest - new balance recipient after the transaction.
10. isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
11. isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
Outline:
1. Exploratory Data Analysis
a. type
b. amount
c. Origination and Destination accounts
d. IsFraud
2. Current Business Model
a. Is the current business model reliable?
b. What are the problems of this model?
3. OTHER insights
4. Pre-processing
5. Detecting fraudulent transactions ---> using Machine Learning predictive models
6. Conclusion
SECTION 1: Exploratory Data Analysis
a. type
c. Origination and Destination accounts
d. IsFraud
This is a highly skewed data, only 0.1% of the transactions are fraudulent.
SECTION 2: Current Business Model
a. Is the current business model reliable?
b. What are the problems of this model?
i. This model failed to flag the fraudulent transactions correctly.
ii. Amount of transaction:
As we can see, amounts in fraudulent transactions varies from 0 to 107, with an mean of around 14,67,968. But the current model only flagged transactions having very high amounts, with a minimum amount of 3,53,874.
Only 2 types have fraudulent transactions, i.e. CASH_OUT and TRANSFER with counts of 4116 & 4097 respectively. But only 16 were flagged correctly, only from the TRANSFER type.
SECTION 3: OTHER insights:
1. Days: time/step was given in hourly basis for the entire month, so it was converted to days.
Since, only types CASH_OUT and TRANSFER have frauds, so the following plots consists of transactions and frauds from these 2 types only.

As we can see, number of CASH_OUT and TRANSFER transactions varies a lot in the whole month, i.e. low in the first week, high in middle and again fairly low towards the end.
But no matter what, number of frauds remains fairly same with mean = 265 and
standard deviation = 23 on an average each day.
So, on an average 7.4% of the transactions are fraudulent per day, which is little high due to the presence of high frauds percentage on days 3rd and 31st. These 2 days are clear outliers for this month.
2. Irregularities in New balance:
How are the new balance for both customers and merchants calculated?
a. for customers : Old balance - amount
b. for merchants: Old balance + amount
But, this is not the case in this dataset for most of the transactions. There are a lot of irregularities.
NOTE: The above formulas were applied and new balance is calculated for both customers and merchants, and compared to the given new balance.
So, "True" means irregular, "False" means not irregular.
So now, out of 8213 total fraud transactions, there are only 127 fraud transactions where there is irregularity in customers new balance, i.e. only 1.54% of the time, irregularities in customers new balance will be a fraud transaction.
Whereas, it is ~65% for irregularities in merchants new balance.
SECTION 4: Pre-processing
1. Time/Step : converted to days (from section 3.1)
2. origbal_diff : binary for origination/customers [True or False] -- (from section 3.2)
3. destbal_diff: binary for destination/merchants [True or False]
SECTION 5: Detecting fraudulent transactions
Okay, now we know that this dataset is very highly skewed, i.e. only 0.1% of the transactions are fraudulent.
I will be using RandomForestClassifier to predict those fraudulent transactions.
So, now this data was divided into training and test data [80:20] ratio.
Lets see some results:
Wow, this is surprising.
Even though the data is highly skewed,
the RandomForestClassifier has a perfect accuracy on the training data, as well as on the test data.
Let's look at the recall-values for both the classes of the test data.
Perfect recall values for both the classes, which means that we got ourselves a perfect model which is predicting the unseen data very well.
It is good to see that most of the predictors have fairly good amount of contributions for the overall performance of the model.
Bias/Variance Tradeoff:
The learning curves are showing us little bit of high variance in the middle and towards the end. But, overall this model is performing really well.
SECTION 6: Conclusion
This financial frauds dataset was analyzed and various insights were discovered as to how these frauds take place. Later on, we built a Classification model which was used to predict these transactions into 2 classes: isFraud OR not.
Even though this data is highly skewed, our ensemble algorithm performed really good in predicting the unseen test data.
1. Exploratory Data Analysis
a. type
b. amount
c. Origination and Destination accounts
d. IsFraud
2. Current Business Model
a. Is the current business model reliable?
b. What are the problems of this model?
3. OTHER insights
4. Pre-processing
5. Detecting fraudulent transactions ---> using Machine Learning predictive models
6. Conclusion
SECTION 1: Exploratory Data Analysis
a. type
CASH_OUT and TRANSFER are the top two types with ~35%.
b. amount
This distribution has a large standard deviation, although the mean (179862) is low.
75% of the data falls under amount 208722, but the maximum amount being 92445517 which is pretty large, as also can be seen in the graph.
There are a lot of extreme outliers present in this data.
d. IsFraud
SECTION 2: Current Business Model
a. Is the current business model reliable?
Ok, so there are about 8200 (i.e. 0.12%) fraudulent transactions in the dataset, but the current model has predicted only 16 (i.e. 0.0003%) as fraudulent.
Lets look at the confusion matrix:
As you can see, all the non-fraud transactions are flagged correctly by the model, but the false-negatives i.e. Type 2 errors are in large numbers.
Although this current model has an accuracy of 99.9%, but it has failed to flag the fraudulent transactions.
b. What are the problems of this model?
i. This model failed to flag the fraudulent transactions correctly.
ii. Amount of transaction:
As we can see, amounts in fraudulent transactions varies from 0 to 107, with an mean of around 14,67,968. But the current model only flagged transactions having very high amounts, with a minimum amount of 3,53,874.
iii. Type of transaction:
SECTION 3: OTHER insights:
1. Days: time/step was given in hourly basis for the entire month, so it was converted to days.
Since, only types CASH_OUT and TRANSFER have frauds, so the following plots consists of transactions and frauds from these 2 types only.
As we can see, number of CASH_OUT and TRANSFER transactions varies a lot in the whole month, i.e. low in the first week, high in middle and again fairly low towards the end.
But no matter what, number of frauds remains fairly same with mean = 265 and
standard deviation = 23 on an average each day.
So, on an average 7.4% of the transactions are fraudulent per day, which is little high due to the presence of high frauds percentage on days 3rd and 31st. These 2 days are clear outliers for this month.
2. Irregularities in New balance:
How are the new balance for both customers and merchants calculated?
a. for customers : Old balance - amount
b. for merchants: Old balance + amount
But, this is not the case in this dataset for most of the transactions. There are a lot of irregularities.
NOTE: The above formulas were applied and new balance is calculated for both customers and merchants, and compared to the given new balance.
So, "True" means irregular, "False" means not irregular.
So now, out of 8213 total fraud transactions, there are only 127 fraud transactions where there is irregularity in customers new balance, i.e. only 1.54% of the time, irregularities in customers new balance will be a fraud transaction.
Whereas, it is ~65% for irregularities in merchants new balance.
SECTION 4: Pre-processing
1. Time/Step : converted to days (from section 3.1)
2. origbal_diff : binary for origination/customers [True or False] -- (from section 3.2)
3. destbal_diff: binary for destination/merchants [True or False]
SECTION 5: Detecting fraudulent transactions
Okay, now we know that this dataset is very highly skewed, i.e. only 0.1% of the transactions are fraudulent.
I will be using RandomForestClassifier to predict those fraudulent transactions.
So, now this data was divided into training and test data [80:20] ratio.
Lets see some results:
Wow, this is surprising.
Even though the data is highly skewed,
the RandomForestClassifier has a perfect accuracy on the training data, as well as on the test data.
Let's look at the recall-values for both the classes of the test data.
Perfect recall values for both the classes, which means that we got ourselves a perfect model which is predicting the unseen data very well.
It is good to see that most of the predictors have fairly good amount of contributions for the overall performance of the model.
Bias/Variance Tradeoff:
The learning curves are showing us little bit of high variance in the middle and towards the end. But, overall this model is performing really well.
SECTION 6: Conclusion
This financial frauds dataset was analyzed and various insights were discovered as to how these frauds take place. Later on, we built a Classification model which was used to predict these transactions into 2 classes: isFraud OR not.
Even though this data is highly skewed, our ensemble algorithm performed really good in predicting the unseen test data.
To view the full code of this project in Github, Click Here
Comments
Post a Comment