Click here to go back

Bank Fraud

1. Import the dataset and explore the data

I am running this script in Google Colab - so you might want to change the location of the dataset and remove the 1st cell.

Addressing the imbalance in data

From the data description, we know that there are about 600,000 rows of data and only 7200 fraud transactions which is 1.2% of the data which means that there is a massive imbalance in the data.

Assuming that in a real world situation, the number of fraud transactions is very low, so, at first, I will not oversample the data to remove the imbalance in the data. First I will try classifying algorithms on the original dataset and evaluate the results. After that I will repeat the process after removing the imbalance in the data.


According to the kaggle description of the data, the step parameter shows during which step was a particular transaction added to the dataset in the datacollection process. Since it does not add any important information to this dataset, we can safely drop the "step" column.

According to the BankSim paper [BankSim], the "age" column has categortical values with 0 being age <= 18 and 6 being >65, and U is unknown value. we will procecss Age column as categorical.

The gender column is also categorical with values "Male", "Female", "Enterprise" and "Unknown"

There are 15 unique values in the "category" column so it will also be treated as categorical data.

The fraud column is the output variable where 0 means that the transaction was not fraud and 1 means that the transaction was fraud.

We see that there only one value in the "zipMerchant" and "zipcodeOri" columns. so we can drop these columns.

There is also extra ' ' as the suffix and prefix in the data, so we need to remove that as well.

2. Datacleaning and preprocessing

First we drop the NaN values in the dataset.

Then I drop the columns which will not be used.

Now I will labelencode the columns which have string values, that is, customer, age, gender, merchant and category columns.

And then I will scale the data using StandardScaler.

Now that label-encoding is done, I can one-hot-encode the categorical columns, which are age, gender and category.

Then I will split the data into training set and testing set.

Now that we have a dataset that we can perform operations on, we can proceed with splitting the data into x and y and then into testing and training data.

Now that we have the data split up in different parts, we can apply machiine learning algorithms to the data.

i.) Applying Classification Algorithms

a.) SVC

SVC takea a lot of time to train the model so we will pass on it in the Cross-Validation data.

b.) Logistic Regression

c.) Random Forest Classifier

d.) Decision Tree Classifier

e.) Ridge Classifier

f.) ADAoost

ii.) Cross-Validation

Now that we have tested some classification models, we can use cross validation on the same algoithms that we used above. I will use 5 folds cross-validation.

a.) Logistic Regression

b.) RFC

c.) DTC

d.) Ridge Classifier

e.) ADABoost

Now since I have also tried the algorithms using cross-validation, I know that there are 30 columns in the x data, I can try using PCA (Principle Data Analysis) to reduce teh number of columns and then I will train all the above algorithms again on the PCA data.

After that I will compare all the metrics using a table and then I will decide on the best model.

PCA shows that only one variable can capture enough details to contribute to about 90% of the total variance. I will now do factor to reduce the data to 1 column.


Algorithm Accuracy True Positive (2840) True Negative(235018)
SVC 1.00 1858 (65.42%) 234882 (99.94)
LR 0.99 1786 (62.88%) 234828 (99.91)
RFC 1.00 2193 (76.46%) 234729 (99.87)
DTC 0.99 2141 (75.38%) 234264 (99.67)
Ridge 0.99 2141 (75.38%) 234264 (99.67)
ADABoost 0.99 1925 (67.78%) 234728 (99.87)

Using Cross-Validation

Algorithm Accuracy True Positive (4360) True Negative(352425)
LR 0.99 2779 (63.73%) 352121 (99.91%)
RFC 1.00 3288 (75.41%) 351956 (99.86%)
SVC 0.99 2902 (66.55%) 352078 (99.90%)
DTC 0.99 3333 (76.44%) 351271 (99.67%)
Ridge 0.99 1544 (35.41%) 352357 (99.98%)
ADABoost 0.99 2992 (68.62%) 351968 (99.87%)

From this table, we can see that the highest acccuracy value for true positive values is after using the DTC algorithm using 10 fold cross validation with a 76.44% accuracy in predicting fraud values and 99.67% accuracy in predicting non-fraud values.

Addressing the data Imbalance

I will address the imbalance in the data by oversampling the fraud values using SMOTE

Here we can see that after oversampling the data, RFC gives correct prediction of 99% of both the fraud and not fraud transactions.


If we use oversampling to remove the data, then we can easily predict the fraud and non-fraud transactions with an accuracy of 99% using RFC.

However, if we do not use oversampling, then RFC gives the best results with prediction oaccuracy of 99% for non-fraud transactions and 76.44% for fraud transactions.