Credit Card Fraud Detection Analysis: A Machine Learning Approach

Square

This is an individual project of SDSC2001 – Python for Data Science. I did the project in my year 2 2021/22 Semester A.

Course Instructor: Professor LI Xinyue

Context

Credit card companies aim to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Content

The dataset contains transactions made by credit cards in September 2013 by european cardholders. Transactions occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, as the positive class (frauds) account for 0.172% of all transactions.

It contains numerical input variables V1-V28 which are the result of a Principal Component Analysis (PCA) transformation, as original features are not provided due to confidentiality issues. Features that have not been transformed with PCA are ‘Time’ and ‘Amount’. ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. ‘Amount’ denotes the transaction Amount. ‘Class’ is the response variable (labelled outcome) and it takes value 1 in case of fraud and 0 otherwise.

Module 1: Data Exploration

In [1]: 
# define all library that I may need to use 
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import seaborn as sns 
import warnings 
warnings.simplefilter(action='ignore', category=FutureWarning) 
pd.options.mode.chained_assignment = None  # remove warning 
from collections import Counter
In [2]:
# load the csv file into a data frame and show the first 5 rows in order to have a quick look on the data
df = pd.read_csv ('creditcard_train.csv')
df.head()


Out[2]:

TimeV1V2V3V4V5V6V7V8V9V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620
10.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690
21.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.5146540.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660
31.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500
42.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.817739-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990
5 rows × 31 columns
In [3]:
# find out the total number of rows and columns in the file
print(df.shape)
# it seems there are some data missing, lets try to solve it!


Out[3]: (284657, 31)

In [4]:
# So first, we have to find out where and how many are the missing data
df.isnull().sum()

Out[4]:Time 0 V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 278 V23 520 V24 0 V25 0 V26 0 V27 0 V28 0 Amount 0 Class 0 dtype: int64

In [5]:
missing_col = ['V22','V23']
#Using mean and looping to impute the missing values
for i in missing_col:
    df.loc[df.loc[:,i].isnull(),i] = df.loc[:,i].mean()
In [6]:
#After filling missing data, we have to detect the outliers and remove them, except the result
​
Q1 = df.iloc[:,:-1].quantile(0.25)
Q3 = df.iloc[:,:-1].quantile(0.75)
IQR = Q3 - Q1
​
data = df[~((df < (Q1 - 2.5 * IQR)) | (df > (Q3 + 2.5 * IQR))).any(axis=1)]
​
print(data.shape)

(213174, 31)

In [7]:
#I want to see some basic info of the data for later data visualization
#like mean of time..., in order to see is there any insight
data.describe().T


Out[7]:

countmeanstdmin25%50%75%max
Time213174.095141.57528147383.8802880.00000054623.50000084469.000000139655.000000172792.000000
V1213174.00.4960821.276026-5.536010-0.6189260.9998621.7189212.454930
V2213174.00.0794630.827586-4.098951-0.4637570.0689910.7034253.808020
V3213174.00.1642401.260900-4.846779-0.6797020.2866121.0889694.079168
V4213174.00.0013651.283098-4.826127-0.7846870.0319070.7039204.720074
V5213174.0-0.0241550.910768-3.252559-0.621899-0.0709610.4923163.870056
V6213174.0-0.2055440.897459-3.666150-0.786134-0.3510980.1892263.315460
V7213174.00.0113570.712423-3.180199-0.4949480.0449310.5132593.008217
V8213174.00.0766810.381320-1.548042-0.1709410.0228330.2669171.666501
V9213174.0-0.0411980.975211-3.672923-0.621874-0.0629380.5350133.695086
V10213174.0-0.0479620.738765-2.999375-0.502612-0.1040480.3420252.926167
V11213174.00.0026810.991865-3.241392-0.772031-0.0061870.7579233.531399
V12213174.00.0440230.856469-2.964042-0.3694060.1619630.6257022.601809
V13213174.0-0.0080201.012035-3.888606-0.686500-0.0102380.6834673.904562
V14213174.00.0186590.743759-2.720881-0.3970020.0481170.4535912.788031
V15213174.0-0.0121360.889419-3.657525-0.5761250.0367380.6229863.601890
V16213174.00.0177350.779686-2.944460-0.4305180.0769660.5008152.686354
V17213174.0-0.0376770.629111-2.311921-0.487001-0.0930530.3458822.606673
V18213174.0-0.0218830.791908-2.997391-0.505811-0.0293410.4600102.997719
V19213174.00.0021760.733705-2.743833-0.4098210.0164570.4373322.744196
V20213174.0-0.0680250.229977-1.073139-0.206210-0.0845030.0649500.993129
V21213174.0-0.0255480.259462-1.251701-0.221372-0.0385750.1553631.222562
V22213174.00.0036380.674379-2.659080-0.5420210.0091030.5134722.471164
V23213174.0-0.0007560.215954-0.933712-0.133954-0.0061910.1299130.921028
V24213174.0-0.0260470.575898-2.337548-0.3644090.0280630.4016401.307137
V25213174.00.0036080.462133-1.986743-0.3052880.0191550.3412511.966419
V26213174.0-0.0015250.461604-1.641329-0.316247-0.0386580.2220191.660394
V27213174.00.0219140.145059-0.475451-0.0550800.0048690.0785750.495576
V28213174.00.0121590.102029-0.380841-0.0457060.0095460.0562360.405955
Amount213174.041.93205453.7865450.0000004.99000018.91000057.200000256.000000
Class213174.00.0001450.0120580.0000000.0000000.0000000.0000001.000000

Module 2: Data Visualization

In [8]:
#First graph, I have to see hows the distributions of all variable
fig,ax=plt.subplots(5,6,figsize=[16,9])
columns=data.columns
for idx,ax in enumerate(ax.flat):
    sns.kdeplot(data.loc[:,columns[idx]][:500],ax=ax)
plt.tight_layout()
#As the data has removed the outliers, but after plotting all the graph, 
#we can see that it is not Normalized and Standardized
#so we may consider to apply normalization into the data
In [9]:
#applying Normalization from sklearn library by using minmaxscaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data.iloc[:,:-1])
data.iloc[:,:-1]=scaler.transform(data.iloc[:,:-1])
In [10]:
#Second graph, we have to see are there any features ('Variables') have correlation with our object ('Class')
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(data.corr()[['Class']].sort_values(by='Class', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features correlating with Class', fontdict={'fontsize':18}, pad=12)
plt.show()
#After plotting the heatmap, we can concluded all the corrleation are very low.
In [11]:
#As the info we got right now aren't very useful, we still can't make any assumption
#then we can only try to visualize the highest correlation between the two variables and class to see is there any useful info
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True) 
g=sns.relplot(
    data=data,
    x="V4", y="V17",
    hue="Class", size="V4",
)
g.set(xscale="log", yscale="log")
g.ax.xaxis.grid(True, "minor", linewidth=.25)
g.ax.yaxis.grid(True, "minor", linewidth=.25)
g.despine(left=True, bottom=True)
plt.show()
#After ploting the scatterplot,
#it seems that there is a serious problem of unbalanced sample size which greatly affect my assumption
#also we can conclude that it cannot be divided by the variables(V4 V17) alone class(0 1) category
In [12]:
#Further investigating of the problem of unbalanced sample size by plotting the number of normal and fraudulent cases out
fig,ax=plt.subplots(figsize=[10,6])
bar=data.Class.value_counts()
sns.barplot(x=bar.index,y=bar.values/len(data),ax=ax)
plt.title('The number of normal and fraudulent cases')
plt.xlabel('Class')
plt.ylabel('Percentage')
for y,x  in enumerate(bar.values/len(data)):
    plt.text(y,x,s=bar[y],va='bottom',ha='center')
plt.show()
#As now, we can conclude the sample size are highly unbalanced,
#I have done a similar analysis on p2p debit and credit risk, the ratio is 1:49
#But this dataset, only have 0.014% positive sample size (fraudulent) 
#it is way more less than the given percentage 0.172%
#maybe it is my problem that some important data are being removed during the process of data cleaning
#Currently, I am pessimistic about this model

Module 3: Dimension Reduction

In [13]:
#We will use principal component analysis from sklearn library to achieve dimension reduction
from sklearn.decomposition import PCA
​
pca=PCA(n_components=2).fit(data.iloc[:,:-1]) #remove 'Class' column and compress the data into 2d
features_pca=pca.transform(data.iloc[:,:-1])

In [14]:
#have some basic view with the principal component
features_pca


Out[14]:array([[ 6.65811147e-01, 1.89550269e-01], [ 4.93164830e-01, -3.89688202e-04], [ 5.96507876e-01, 1.64345532e-01], …, [-3.12209498e-01, -2.19958181e-01], [-5.22941908e-01, 6.27902788e-02], [-4.04661011e-01, 2.29348807e-01]])

In [15]:
sns.relplot(x=features_pca[:,0],y=features_pca[:,1],hue=data.Class,col=data.Class)
plt.show()
#As the positive and negative sample size are unbalanced, it is difficult to show them on single graph
#so we have to divide them into two
#Now, we can conclude that this is a linearly inseparable problem.
#It is very difficult for us to find a decision boundary to separate the categories of 0 and 1.

Module 4: Classification

In [16]:
###pick 3 classification methods, and methods not in the below list can also be used
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
In [17]:
#Before doing classification and modelling, 
#we have to do some preparation on our data first in order to solve the undersampling problem.
#First, we filter out all the positive sample.
pos=data.loc[data.Class==1]
pos.head()


Out[17]:

TimeV1V2V3V4V5V6V7V8V9V21V22V23V24V25V26V27V28AmountClass
104710.0994660.8289870.6320280.5872170.8376510.4813260.4933200.5101810.4781920.6152230.3347050.3622100.5452660.6873450.5911530.4610890.5029710.5529960.0148051
104840.0996570.8416770.6375690.5522230.8223350.5145530.4944040.5225500.4472450.6184470.3027430.3141530.4728210.5478830.6394830.4673400.4865050.5473590.0148051
143190.1471480.8336120.6613270.4351140.7858430.5956570.4902640.5649900.5198260.3433190.4385000.4165780.4428310.4789090.6343110.5102430.5052490.5648890.0146871
273390.1997840.8280930.5710610.6396740.7696240.4559470.6058050.4939590.5509850.4070740.4414710.4588330.5323290.6445850.5974680.4649180.5112720.4974610.0059381
504970.2577200.6633870.5633460.7639480.3740270.3609870.4297600.5234470.4805940.6508370.5950750.6961050.3490380.7651570.5502070.3360320.6164970.5343460.0039061
5 rows × 31 columns
In [18]:
#We have to generate some random sample for later classification use
np.random.seed(1234)
​
data=data.sample(frac=1) #Return a random sample of data.
neg=data.loc[data.Class==0][:len(pos)] #Select negative samples with the same number of positive samples
neg.head()


Out[18]:

TimeV1V2V3V4V5V6V7V8V9V21V22V23V24V25V26V27V28AmountClass
1269920.4525610.8625800.4274050.5497500.4203350.4172020.6599590.3911110.5382330.4101590.3080370.3887020.4342890.2945690.6066180.8444280.4685680.4626630.0585940
1833430.7280600.9034620.4025170.6060080.6656320.2917940.6593130.3024410.6264030.7403880.6098770.7029270.5247600.5340890.4407020.3384470.5934750.4617790.3437110
646200.2966170.8443770.5225740.6507330.6246340.3906610.5682120.4340910.5045820.6023960.4642290.5129380.4368080.5250020.6460650.3878430.5712800.5252520.0390230
388070.2290210.8194140.5253360.5817080.6568170.4425020.5081980.5573310.4457150.5079530.4907520.5214580.4230990.6790170.6749140.3983780.5246190.5258790.3440230
1128180.4215700.6184440.6737660.6446090.5409790.4148520.3983530.6394530.5268510.3475340.3908980.3152020.6996720.7481870.3040740.4868420.4615610.6176720.2416020
5 rows × 31 columns
In [19]:
#Concatenate the positive and negative data and read the test set
train=pd.concat([pos,neg])
test=pd.read_csv ('creditcard_test.csv')
test.head()


Out[19]:

TimeV1V2V3V4V5V6V7V8V9V21V22V23V24V25V26V27V28AmountClass
0400861.0836931.179501-1.3461501.9988240.818034-0.7714190.2303070.093683-0.167594-0.312000-0.639700-0.120249-0.1802180.609283-0.3395240.0967010.1149721.001
193860-10.8502826.727466-16.7605838.425832-10.252697-4.192171-14.0770867.168288-3.6832422.5416370.135535-1.0239670.4062650.106593-0.026232-1.464630-0.41168278.001
214152-4.7105298.636214-15.49622210.313349-4.351341-3.322689-10.7883735.060381-5.6893111.9905450.2237850.554408-1.204042-0.4506850.6418361.6059580.7216441.001
327219-25.26635514.323254-26.8236736.349248-18.664251-4.647403-17.97121216.633103-3.7683511.780701-1.861318-1.1881670.1566671.768192-0.2199161.4118550.41465699.991
484204-1.9274531.827621-7.0194955.348303-2.739188-2.107219-5.0158481.205868-4.3827131.376938-0.792017-0.771414-0.3795740.7187171.1111511.2777070.819081512.251
5 rows × 31 columns
In [20]:
#Perform the same data transformation on the test set
test.iloc[:,:-1]=scaler.transform(test.iloc[:,:-1])
test.head()


Out[20]:

TimeV1V2V3V4V5V6V7V8V9V21V22V23V24V25V26V27V28AmountClass
00.2319900.8284010.6675690.3921860.7149390.5715030.4146220.5511110.5107180.4757500.3797910.3936230.4385860.5919110.6566960.3942800.5892240.6301670.0039061
10.543196-0.6650371.369224-1.3347381.388192-0.982804-0.075344-1.7608522.711530-0.0014011.5331180.544733-0.0486620.7528260.5295350.489168-1.018694-0.0391980.3046881
20.0819020.1033021.610625-1.1930881.585916-0.1542670.049195-1.2294222.055789-0.2736681.3103890.5619350.8023340.3110030.3885640.6915072.1435131.4012320.0039061
30.157525-2.4690892.329869-2.4621361.170662-2.163769-0.140548-2.3901135.655904-0.0129521.2255780.155502-0.1371920.6843430.9498560.4305061.9436181.0110590.3905861
40.4873140.4515810.749538-0.2434161.0658090.0720760.223291-0.2966270.856703-0.0963341.0623930.3639330.0875040.5372130.6843790.8336501.8054681.5250732.0009771
5 rows × 31 columns
In [21]:
#import cross-validation from sklearn library
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix,classification_report
In [22]:
#train our data
train_y=train.pop('Class')
train_x=train
test_y=test.pop('Class')
test_x=test
In [23]:
#build a function that uses for training and evaluating different models, using 5-fold cross-validation
def train_and_evaluate(model,params):
    gs=GridSearchCV(model(random_state=1234),
                 param_grid=params, cv=5).fit(train_x,train_y)
    print('Train score :',gs.best_score_)
    print('Test score :',gs.score(test_x,test_y))
    plot_confusion_matrix(gs, test_x, test_y)
    plt.show()
    print(classification_report(test_y,gs.predict(test_x)))
    return gs

4.1 Training with RandomForestClassifier

In [24]:
m1=train_and_evaluate(RandomForestClassifier,
                   params={'n_estimators':np.arange(10,40,5),
                          'max_features':np.arange(10,20,1)})

Train score : 0.8397435897435898 Test score : 0.78

4.2 Training with LogisticRegression

In [25]:
m2=train_and_evaluate(LogisticRegression,
                   params={
                          'C':np.linspace(0.1,1,10)}
                           )

Train score : 0.7371794871794872 Test score : 0.7333333333333333

4.3 Training with Support Vector Machine

In [26]:
m3=train_and_evaluate(SVC,
                   params={
                          'C':np.arange(1,10,1),
                          'gamma':np.linspace(0.01,0.1,10)
                   }
                           )

Train score : 0.7717948717948718 Test score : 0.66

In [27]:
#After we have realised that the best model among these three should be random forest,
#we sort out most relevant features ranking
pd.DataFrame({'Feature':train_x.columns,
             'Value':m1.best_estimator_.feature_importances_}).sort_values(by='Value',ascending=False)


Out[27]:

FeatureValue
14V140.218701
17V170.125888
4V40.111668
19V190.084166
16V160.065130
29Amount0.052899
15V150.038870
13V130.038309
27V270.034972
2V20.027462
28V280.022460
8V80.022411
23V230.018893
20V200.018022
6V60.016508
5V50.012932
0Time0.012430
18V180.011545
26V260.009688
9V90.009620
10V100.009097
22V220.007824
11V110.007375
12V120.006643
7V70.004063
1V10.003455
25V250.003063
21V210.002532
24V240.001935
3V30.001440

Module 5: Summary

I am going to summarize my findings and draw conclusions by using Q and A format. So that this chance of experience can be better recorded.

Q: What have you done and what have you learned?
A: So basically to conclude what I have done, I have to understand the property of the given data first so that I can start building up the plan to analyze it which is all the stuff you can see in module 1. Then, I will go deeper into the data like visualizing different variables with the target value by plotting graph. Although it may not really work just like this time, at least it still gives me some insight like the problem is linearly inseparable so that I can solve it with different method. In the process of doing all of this, you can see that I keep adjusting the dataset so that it fits my requirement to finish certain tasks. For example, without removing outliers, the accuracy of the model must be affected so all the steps I have done are effective and useful for the entire project. In the dimension reduction and classification, I have applied the skills I learnt from letures and tutorials. Apart from that, I use quite a lot of time on reading docs of some function and the library. Except from the offical docs, I found stackerflow and geeksforgeeks provided me many useful resources as well. This experience can certainly conslidate my coding skills and logic flow.

Q: What is the biggest difficuly in this project and how did you solve it?
A: Undoubtedly, I would say module 2 is the most challenging parts for me. As there isn’t many instructions on this module which gives me more freedom to interpret it, the dataset itself doesn’t give me any hints of finding the relation between feautres(variables) and objects(class). So I stucked in this part for quite long after plotting many different graphs, but I turned out understand that not always there is a big relationship between a single variable and a target. You have to know that sometimes finding them doesn’t have great relation is also a part of exploration and analyzing result. There is no fixed solution for a problem, data only gives you insight on how you can use it to prove your view.

Q: What do you think of annotation, how does it helps?
A: I have annotate each part steps by steps in order to let the readers understand what is that part doing. It helps me to debug as I can find the problem of code quickly with those annotation.

Q: How you finish solving this project?
A: My basic flow of solving quesion and doing analysis is writing all steps onto a paper first. During the progress, I have to further brainstorm the possibility of doing wrong or missing important factors, so if there are any hints that can be followed, then I will strictly follow the given procedures so that I won’t digress from the topic and don’t know what I am doing.

Q: What are the main results?
A: As I stated above, Random forest is currently the best performing model among the models I have tested, but the probability of classifying positive samples on the test set is the same as logistic regression, which is only 65%.

Q: From the result, what have you discovered?
A: In my opinion, this kind of fraud detection is obviously more important to correctly classify a positive sample than to classify a negative sample. It is because if you judge a negative sample as a positive sample, it may cause economic losses for the client and the company.

Q: What advice you can give to the credit card company?
A: If you want to avoid this situation, you need to increase the weight of the positive samples, but after doing so, more negative samples will be classified into positive samples, thus rejecting more customers to apply for credit cards. And there will also be a lose of profit. As a result, in order to find a balance, it really depends on how your company choose between risks and profits.

Leave a Reply

Your email address will not be published. Required fields are marked *