Credit Card Fraud Detection Analysis: A Machine Learning Approach

This is an individual project of SDSC2001 – Python for Data Science. I did the project in my year 2 2021/22 Semester A.

Course Instructor: Professor LI Xinyue

Table of Contents

Context

Credit card companies aim to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Content

The dataset contains transactions made by credit cards in September 2013 by european cardholders. Transactions occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, as the positive class (frauds) account for 0.172% of all transactions.

It contains numerical input variables V1-V28 which are the result of a Principal Component Analysis (PCA) transformation, as original features are not provided due to confidentiality issues. Features that have not been transformed with PCA are ‘Time’ and ‘Amount’. ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. ‘Amount’ denotes the transaction Amount. ‘Class’ is the response variable (labelled outcome) and it takes value 1 in case of fraud and 0 otherwise.

Module 1: Data Exploration

In [1]: 
# define all library that I may need to use 
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import seaborn as sns 
import warnings 
warnings.simplefilter(action='ignore', category=FutureWarning) 
pd.options.mode.chained_assignment = None  # remove warning 
from collections import Counter

In [2]:
# load the csv file into a data frame and show the first 5 rows in order to have a quick look on the data
df = pd.read_csv ('creditcard_train.csv')
df.head()

Out[2]:

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	…	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	…	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	…	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	…	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	…	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	…	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 31 columns

In [3]:
# find out the total number of rows and columns in the file
print(df.shape)
# it seems there are some data missing, lets try to solve it!

Out[3]: (284657, 31)

In [4]:
# So first, we have to find out where and how many are the missing data
df.isnull().sum()

Out[4]:Time 0 V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 278 V23 520 V24 0 V25 0 V26 0 V27 0 V28 0 Amount 0 Class 0 dtype: int64

In [5]:
missing_col = ['V22','V23']
#Using mean and looping to impute the missing values
for i in missing_col:
    df.loc[df.loc[:,i].isnull(),i] = df.loc[:,i].mean()

In [6]:
#After filling missing data, we have to detect the outliers and remove them, except the result

Q1 = df.iloc[:,:-1].quantile(0.25)
Q3 = df.iloc[:,:-1].quantile(0.75)
IQR = Q3 - Q1

data = df[~((df < (Q1 - 2.5 * IQR)) | (df > (Q3 + 2.5 * IQR))).any(axis=1)]

print(data.shape)

(213174, 31)

In [7]:
#I want to see some basic info of the data for later data visualization
#like mean of time..., in order to see is there any insight
data.describe().T

Out[7]:

	count	mean	std	min	25%	50%	75%	max
Time	213174.0	95141.575281	47383.880288	0.000000	54623.500000	84469.000000	139655.000000	172792.000000
V1	213174.0	0.496082	1.276026	-5.536010	-0.618926	0.999862	1.718921	2.454930
V2	213174.0	0.079463	0.827586	-4.098951	-0.463757	0.068991	0.703425	3.808020
V3	213174.0	0.164240	1.260900	-4.846779	-0.679702	0.286612	1.088969	4.079168
V4	213174.0	0.001365	1.283098	-4.826127	-0.784687	0.031907	0.703920	4.720074
V5	213174.0	-0.024155	0.910768	-3.252559	-0.621899	-0.070961	0.492316	3.870056
V6	213174.0	-0.205544	0.897459	-3.666150	-0.786134	-0.351098	0.189226	3.315460
V7	213174.0	0.011357	0.712423	-3.180199	-0.494948	0.044931	0.513259	3.008217
V8	213174.0	0.076681	0.381320	-1.548042	-0.170941	0.022833	0.266917	1.666501
V9	213174.0	-0.041198	0.975211	-3.672923	-0.621874	-0.062938	0.535013	3.695086
V10	213174.0	-0.047962	0.738765	-2.999375	-0.502612	-0.104048	0.342025	2.926167
V11	213174.0	0.002681	0.991865	-3.241392	-0.772031	-0.006187	0.757923	3.531399
V12	213174.0	0.044023	0.856469	-2.964042	-0.369406	0.161963	0.625702	2.601809
V13	213174.0	-0.008020	1.012035	-3.888606	-0.686500	-0.010238	0.683467	3.904562
V14	213174.0	0.018659	0.743759	-2.720881	-0.397002	0.048117	0.453591	2.788031
V15	213174.0	-0.012136	0.889419	-3.657525	-0.576125	0.036738	0.622986	3.601890
V16	213174.0	0.017735	0.779686	-2.944460	-0.430518	0.076966	0.500815	2.686354
V17	213174.0	-0.037677	0.629111	-2.311921	-0.487001	-0.093053	0.345882	2.606673
V18	213174.0	-0.021883	0.791908	-2.997391	-0.505811	-0.029341	0.460010	2.997719
V19	213174.0	0.002176	0.733705	-2.743833	-0.409821	0.016457	0.437332	2.744196
V20	213174.0	-0.068025	0.229977	-1.073139	-0.206210	-0.084503	0.064950	0.993129
V21	213174.0	-0.025548	0.259462	-1.251701	-0.221372	-0.038575	0.155363	1.222562
V22	213174.0	0.003638	0.674379	-2.659080	-0.542021	0.009103	0.513472	2.471164
V23	213174.0	-0.000756	0.215954	-0.933712	-0.133954	-0.006191	0.129913	0.921028
V24	213174.0	-0.026047	0.575898	-2.337548	-0.364409	0.028063	0.401640	1.307137
V25	213174.0	0.003608	0.462133	-1.986743	-0.305288	0.019155	0.341251	1.966419
V26	213174.0	-0.001525	0.461604	-1.641329	-0.316247	-0.038658	0.222019	1.660394
V27	213174.0	0.021914	0.145059	-0.475451	-0.055080	0.004869	0.078575	0.495576
V28	213174.0	0.012159	0.102029	-0.380841	-0.045706	0.009546	0.056236	0.405955
Amount	213174.0	41.932054	53.786545	0.000000	4.990000	18.910000	57.200000	256.000000
Class	213174.0	0.000145	0.012058	0.000000	0.000000	0.000000	0.000000	1.000000

Module 2: Data Visualization

In [8]:
#First graph, I have to see hows the distributions of all variable
fig,ax=plt.subplots(5,6,figsize=[16,9])
columns=data.columns
for idx,ax in enumerate(ax.flat):
    sns.kdeplot(data.loc[:,columns[idx]][:500],ax=ax)
plt.tight_layout()
#As the data has removed the outliers, but after plotting all the graph, 
#we can see that it is not Normalized and Standardized
#so we may consider to apply normalization into the data

In [9]:
#applying Normalization from sklearn library by using minmaxscaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data.iloc[:,:-1])
data.iloc[:,:-1]=scaler.transform(data.iloc[:,:-1])

In [10]:
#Second graph, we have to see are there any features ('Variables') have correlation with our object ('Class')
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(data.corr()[['Class']].sort_values(by='Class', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features correlating with Class', fontdict={'fontsize':18}, pad=12)
plt.show()
#After plotting the heatmap, we can concluded all the corrleation are very low.

In [11]:
#As the info we got right now aren't very useful, we still can't make any assumption
#then we can only try to visualize the highest correlation between the two variables and class to see is there any useful info
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True) 
g=sns.relplot(
    data=data,
    x="V4", y="V17",
    hue="Class", size="V4",
)
g.set(xscale="log", yscale="log")
g.ax.xaxis.grid(True, "minor", linewidth=.25)
g.ax.yaxis.grid(True, "minor", linewidth=.25)
g.despine(left=True, bottom=True)
plt.show()
#After ploting the scatterplot,
#it seems that there is a serious problem of unbalanced sample size which greatly affect my assumption
#also we can conclude that it cannot be divided by the variables(V4 V17) alone class(0 1) category

In [12]:
#Further investigating of the problem of unbalanced sample size by plotting the number of normal and fraudulent cases out
fig,ax=plt.subplots(figsize=[10,6])
bar=data.Class.value_counts()
sns.barplot(x=bar.index,y=bar.values/len(data),ax=ax)
plt.title('The number of normal and fraudulent cases')
plt.xlabel('Class')
plt.ylabel('Percentage')
for y,x  in enumerate(bar.values/len(data)):
    plt.text(y,x,s=bar[y],va='bottom',ha='center')
plt.show()
#As now, we can conclude the sample size are highly unbalanced,
#I have done a similar analysis on p2p debit and credit risk, the ratio is 1:49
#But this dataset, only have 0.014% positive sample size (fraudulent) 
#it is way more less than the given percentage 0.172%
#maybe it is my problem that some important data are being removed during the process of data cleaning
#Currently, I am pessimistic about this model

Module 3: Dimension Reduction

In [13]:
#We will use principal component analysis from sklearn library to achieve dimension reduction
from sklearn.decomposition import PCA

pca=PCA(n_components=2).fit(data.iloc[:,:-1]) #remove 'Class' column and compress the data into 2d
features_pca=pca.transform(data.iloc[:,:-1])

In [14]:
#have some basic view with the principal component
features_pca

Out[14]:array([[ 6.65811147e-01, 1.89550269e-01], [ 4.93164830e-01, -3.89688202e-04], [ 5.96507876e-01, 1.64345532e-01], …, [-3.12209498e-01, -2.19958181e-01], [-5.22941908e-01, 6.27902788e-02], [-4.04661011e-01, 2.29348807e-01]])

In [15]:
sns.relplot(x=features_pca[:,0],y=features_pca[:,1],hue=data.Class,col=data.Class)
plt.show()
#As the positive and negative sample size are unbalanced, it is difficult to show them on single graph
#so we have to divide them into two
#Now, we can conclude that this is a linearly inseparable problem.
#It is very difficult for us to find a decision boundary to separate the categories of 0 and 1.

Module 4: Classification

In [16]:
###pick 3 classification methods, and methods not in the below list can also be used
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
In [17]:
#Before doing classification and modelling, 
#we have to do some preparation on our data first in order to solve the undersampling problem.
#First, we filter out all the positive sample.
pos=data.loc[data.Class==1]
pos.head()

Out[17]:

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	…	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
10471	0.099466	0.828987	0.632028	0.587217	0.837651	0.481326	0.493320	0.510181	0.478192	0.615223	…	0.334705	0.362210	0.545266	0.687345	0.591153	0.461089	0.502971	0.552996	0.014805	1
10484	0.099657	0.841677	0.637569	0.552223	0.822335	0.514553	0.494404	0.522550	0.447245	0.618447	…	0.302743	0.314153	0.472821	0.547883	0.639483	0.467340	0.486505	0.547359	0.014805	1
14319	0.147148	0.833612	0.661327	0.435114	0.785843	0.595657	0.490264	0.564990	0.519826	0.343319	…	0.438500	0.416578	0.442831	0.478909	0.634311	0.510243	0.505249	0.564889	0.014687	1
27339	0.199784	0.828093	0.571061	0.639674	0.769624	0.455947	0.605805	0.493959	0.550985	0.407074	…	0.441471	0.458833	0.532329	0.644585	0.597468	0.464918	0.511272	0.497461	0.005938	1
50497	0.257720	0.663387	0.563346	0.763948	0.374027	0.360987	0.429760	0.523447	0.480594	0.650837	…	0.595075	0.696105	0.349038	0.765157	0.550207	0.336032	0.616497	0.534346	0.003906	1

5 rows × 31 columns

In [18]:
#We have to generate some random sample for later classification use
np.random.seed(1234)

data=data.sample(frac=1) #Return a random sample of data.
neg=data.loc[data.Class==0][:len(pos)] #Select negative samples with the same number of positive samples
neg.head()

Out[18]:

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	…	V21	V22	V23	V24	V25	V26	V27	V28	Amount
126992	0.452561	0.862580	0.427405	0.549750	0.420335	0.417202	0.659959	0.391111	0.538233	0.410159	…	0.308037	0.388702	0.434289	0.294569	0.606618	0.844428	0.468568	0.462663	0.058594
183343	0.728060	0.903462	0.402517	0.606008	0.665632	0.291794	0.659313	0.302441	0.626403	0.740388	…	0.609877	0.702927	0.524760	0.534089	0.440702	0.338447	0.593475	0.461779	0.343711
64620	0.296617	0.844377	0.522574	0.650733	0.624634	0.390661	0.568212	0.434091	0.504582	0.602396	…	0.464229	0.512938	0.436808	0.525002	0.646065	0.387843	0.571280	0.525252	0.039023
38807	0.229021	0.819414	0.525336	0.581708	0.656817	0.442502	0.508198	0.557331	0.445715	0.507953	…	0.490752	0.521458	0.423099	0.679017	0.674914	0.398378	0.524619	0.525879	0.344023
112818	0.421570	0.618444	0.673766	0.644609	0.540979	0.414852	0.398353	0.639453	0.526851	0.347534	…	0.390898	0.315202	0.699672	0.748187	0.304074	0.486842	0.461561	0.617672	0.241602

5 rows × 31 columns

In [19]:
#Concatenate the positive and negative data and read the test set
train=pd.concat([pos,neg])
test=pd.read_csv ('creditcard_test.csv')
test.head()

Out[19]:

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	…	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0	40086	1.083693	1.179501	-1.346150	1.998824	0.818034	-0.771419	0.230307	0.093683	-0.167594	…	-0.312000	-0.639700	-0.120249	-0.180218	0.609283	-0.339524	0.096701	0.114972	1.00	1
1	93860	-10.850282	6.727466	-16.760583	8.425832	-10.252697	-4.192171	-14.077086	7.168288	-3.683242	…	2.541637	0.135535	-1.023967	0.406265	0.106593	-0.026232	-1.464630	-0.411682	78.00	1
2	14152	-4.710529	8.636214	-15.496222	10.313349	-4.351341	-3.322689	-10.788373	5.060381	-5.689311	…	1.990545	0.223785	0.554408	-1.204042	-0.450685	0.641836	1.605958	0.721644	1.00	1
3	27219	-25.266355	14.323254	-26.823673	6.349248	-18.664251	-4.647403	-17.971212	16.633103	-3.768351	…	1.780701	-1.861318	-1.188167	0.156667	1.768192	-0.219916	1.411855	0.414656	99.99	1
4	84204	-1.927453	1.827621	-7.019495	5.348303	-2.739188	-2.107219	-5.015848	1.205868	-4.382713	…	1.376938	-0.792017	-0.771414	-0.379574	0.718717	1.111151	1.277707	0.819081	512.25	1

5 rows × 31 columns

In [20]:
#Perform the same data transformation on the test set
test.iloc[:,:-1]=scaler.transform(test.iloc[:,:-1])
test.head()

Out[20]:

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	…	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0	0.231990	0.828401	0.667569	0.392186	0.714939	0.571503	0.414622	0.551111	0.510718	0.475750	…	0.379791	0.393623	0.438586	0.591911	0.656696	0.394280	0.589224	0.630167	0.003906	1
1	0.543196	-0.665037	1.369224	-1.334738	1.388192	-0.982804	-0.075344	-1.760852	2.711530	-0.001401	…	1.533118	0.544733	-0.048662	0.752826	0.529535	0.489168	-1.018694	-0.039198	0.304688	1
2	0.081902	0.103302	1.610625	-1.193088	1.585916	-0.154267	0.049195	-1.229422	2.055789	-0.273668	…	1.310389	0.561935	0.802334	0.311003	0.388564	0.691507	2.143513	1.401232	0.003906	1
3	0.157525	-2.469089	2.329869	-2.462136	1.170662	-2.163769	-0.140548	-2.390113	5.655904	-0.012952	…	1.225578	0.155502	-0.137192	0.684343	0.949856	0.430506	1.943618	1.011059	0.390586	1
4	0.487314	0.451581	0.749538	-0.243416	1.065809	0.072076	0.223291	-0.296627	0.856703	-0.096334	…	1.062393	0.363933	0.087504	0.537213	0.684379	0.833650	1.805468	1.525073	2.000977	1

5 rows × 31 columns

In [21]:
#import cross-validation from sklearn library
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix,classification_report
In [22]:
#train our data
train_y=train.pop('Class')
train_x=train
test_y=test.pop('Class')
test_x=test
In [23]:
#build a function that uses for training and evaluating different models, using 5-fold cross-validation
def train_and_evaluate(model,params):
    gs=GridSearchCV(model(random_state=1234),
                 param_grid=params, cv=5).fit(train_x,train_y)
    print('Train score :',gs.best_score_)
    print('Test score :',gs.score(test_x,test_y))
    plot_confusion_matrix(gs, test_x, test_y)
    plt.show()
    print(classification_report(test_y,gs.predict(test_x)))
    return gs

4.1 Training with RandomForestClassifier

In [24]:
m1=train_and_evaluate(RandomForestClassifier,
                   params={'n_estimators':np.arange(10,40,5),
                          'max_features':np.arange(10,20,1)})

Train score : 0.8397435897435898 Test score : 0.78

4.2 Training with LogisticRegression

In [25]:
m2=train_and_evaluate(LogisticRegression,
                   params={
                          'C':np.linspace(0.1,1,10)}
                           )

Train score : 0.7371794871794872 Test score : 0.7333333333333333

4.3 Training with Support Vector Machine

In [26]:
m3=train_and_evaluate(SVC,
                   params={
                          'C':np.arange(1,10,1),
                          'gamma':np.linspace(0.01,0.1,10)
                   }
                           )

Train score : 0.7717948717948718 Test score : 0.66

In [27]:
#After we have realised that the best model among these three should be random forest,
#we sort out most relevant features ranking
pd.DataFrame({'Feature':train_x.columns,
             'Value':m1.best_estimator_.feature_importances_}).sort_values(by='Value',ascending=False)

Out[27]:

	Feature	Value
14	V14	0.218701
17	V17	0.125888
4	V4	0.111668
19	V19	0.084166
16	V16	0.065130
29	Amount	0.052899
15	V15	0.038870
13	V13	0.038309
27	V27	0.034972
2	V2	0.027462
28	V28	0.022460
8	V8	0.022411
23	V23	0.018893
20	V20	0.018022
6	V6	0.016508
5	V5	0.012932
0	Time	0.012430
18	V18	0.011545
26	V26	0.009688
9	V9	0.009620
10	V10	0.009097
22	V22	0.007824
11	V11	0.007375
12	V12	0.006643
7	V7	0.004063
1	V1	0.003455
25	V25	0.003063
21	V21	0.002532
24	V24	0.001935
3	V3	0.001440

Module 5: Summary

I am going to summarize my findings and draw conclusions by using Q and A format. So that this chance of experience can be better recorded.

Q: What have you done and what have you learned?
A: So basically to conclude what I have done, I have to understand the property of the given data first so that I can start building up the plan to analyze it which is all the stuff you can see in module 1. Then, I will go deeper into the data like visualizing different variables with the target value by plotting graph. Although it may not really work just like this time, at least it still gives me some insight like the problem is linearly inseparable so that I can solve it with different method. In the process of doing all of this, you can see that I keep adjusting the dataset so that it fits my requirement to finish certain tasks. For example, without removing outliers, the accuracy of the model must be affected so all the steps I have done are effective and useful for the entire project. In the dimension reduction and classification, I have applied the skills I learnt from letures and tutorials. Apart from that, I use quite a lot of time on reading docs of some function and the library. Except from the offical docs, I found stackerflow and geeksforgeeks provided me many useful resources as well. This experience can certainly conslidate my coding skills and logic flow.

Q: What is the biggest difficuly in this project and how did you solve it?
A: Undoubtedly, I would say module 2 is the most challenging parts for me. As there isn’t many instructions on this module which gives me more freedom to interpret it, the dataset itself doesn’t give me any hints of finding the relation between feautres(variables) and objects(class). So I stucked in this part for quite long after plotting many different graphs, but I turned out understand that not always there is a big relationship between a single variable and a target. You have to know that sometimes finding them doesn’t have great relation is also a part of exploration and analyzing result. There is no fixed solution for a problem, data only gives you insight on how you can use it to prove your view.

Q: What do you think of annotation, how does it helps?
A: I have annotate each part steps by steps in order to let the readers understand what is that part doing. It helps me to debug as I can find the problem of code quickly with those annotation.

Q: How you finish solving this project?
A: My basic flow of solving quesion and doing analysis is writing all steps onto a paper first. During the progress, I have to further brainstorm the possibility of doing wrong or missing important factors, so if there are any hints that can be followed, then I will strictly follow the given procedures so that I won’t digress from the topic and don’t know what I am doing.

Q: What are the main results?
A: As I stated above, Random forest is currently the best performing model among the models I have tested, but the probability of classifying positive samples on the test set is the same as logistic regression, which is only 65%.

Q: From the result, what have you discovered?
A: In my opinion, this kind of fraud detection is obviously more important to correctly classify a positive sample than to classify a negative sample. It is because if you judge a negative sample as a positive sample, it may cause economic losses for the client and the company.

Q: What advice you can give to the credit card company?
A: If you want to avoid this situation, you need to increase the weight of the positive samples, but after doing so, more negative samples will be classified into positive samples, thus rejecting more customers to apply for credit cards. And there will also be a lose of profit. As a result, in order to find a balance, it really depends on how your company choose between risks and profits.

Context

Content

Module 1: Data Exploration

Module 2: Data Visualization

Module 3: Dimension Reduction

Module 4: Classification

4.1 Training with RandomForestClassifier

4.2 Training with LogisticRegression

4.3 Training with Support Vector Machine

Module 5: Summary

Leave a Reply Cancel reply