topher nguyen project zillow

Using python, I am training the Random Forest Classifier on known credit card fraud to search for fraudulent credit card purchases. This is important in real world applications because companies can be more efficient by removing fraud, and investigating falst positives and true negatives:

Importing Libraries:

Imports necessary libraries for data manipulation, visualization, file handling, and machine learning.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix

Main Function:

Orchestrates the workflow by calling other functions.

def main():
	files = configure()
	data = eda(files)
	xTrain, xTest, yTrain, yTest = split(data)
	rcc(xTrain, xTest, yTrain, yTest)

Configure Function:

Retrieves the current working directory and lists all CSV files in it.

def configure():
	cwd = os.getcwd()
	files = glob.glob('*.csv')

	print(f'Opening the following directory {cwd}')
	print(f'The files found in this directory are: {files}')

	return(files)

EDA Function:

Performs exploratory data analysis on each file, providing data shape, descriptive statistics, segregating fraudulent and valid transactions, and generating a correlation matrix heatmap.

def eda(files):
	for file in files:
		data = pd.read_csv(file)
		shape = data.shape
		description = data.describe()
		fraud = data[data['Class'] == 1]
		valid = data[data['Class'] == 0]
		FraudPercentage = len(fraud)/float(len(data))
		fraud_description = fraud.Amount.describe()
		valid_description = valid.Amount.describe()

		with open('01_cc_eda.txt', 'a') as file:
			file.write(f'Data Shape = {shape}\n')
			file.write(f'Data Description = {description}\n\n')
			file.write(f'{FraudPercentage = :.2%}\n')
			file.write(f'Fraud Transactions = {len(fraud)}\n')
			file.write(f'Valid Transactions = {len(valid)}\n\n')
			file.write(f'Amount details of the fraudulent transaction = 
				{fraud_description}\n\n')
			file.write(f'Amount details of the valid transaction = 
				{valid_description}\n\n')

		corrmat = data.corr()
		fig = plt.figure(figsize = (12, 9))
		sns.heatmap(corrmat, vmax = .8, square = True)
		plt.savefig('02_cc_eda_corr_map.png', bbox_inches='tight', pad_inches=0.0)

		return(data)

Split Function:

Splits the data into training and testing sets.

def split(data):
	X = data.drop(['Class'], axis = 1)
	Y = data["Class"]
	xData = X.values
	yData = Y.values
	xTrain, xTest, yTrain, yTest = train_test_split(xData, yData, test_size = 0.2, 
		random_state = 78)
	return(xTrain, xTest, yTrain, yTest)

Random Forest Classifier Function:

Trains the Random Forest classifier and evaluates its performance, calculating various metrics and generating a confusion matrix heatmap.

def rcc(xTrain, xTest, yTrain, yTest):
	rfc = RandomForestClassifier()
	rfc.fit(xTrain, yTrain)
	yPred = rfc.predict(xTest)
	acc = accuracy_score(yTest, yPred)
	prec = precision_score(yTest, yPred)
	rec = recall_score(yTest, yPred)
	f1 = f1_score(yTest, yPred)
	MCC = matthews_corrcoef(yTest, yPred)
	cr = classification_report(yTest, yPred)
	with open('03_cc_output.txt', 'a') as file:
		file.write(f'The model used is Random Forest classifier\n')
		file.write(f'The accuracy is {acc} \n')
		file.write(f'The precision is {prec}\n')
		file.write(f'The recall is {rec}\n')
		file.write(f'The F1-Score is {f1}\n')
		file.write(f'The Matthews correlation coefficient is {MCC}\n\n')
		file.write(f'The Classification Report is {cr}\n')

	LABELS = ['Normal', 'Fraud']
	conf_matrix = confusion_matrix(yTest, yPred)
	plt.figure(figsize =(12, 12))
	sns.heatmap(conf_matrix, xticklabels = LABELS, 
				yticklabels = LABELS, annot = True, fmt ="d");
	plt.title("Confusion matrix")
	plt.ylabel('True class')
	plt.xlabel('Predicted class')
	plt.savefig('04_cc_conf_matrix.png', bbox_inches='tight', pad_inches=0.0)

Explanation

The model used is Random Forest classifier The accuracy is 99.96%. This high accuracy indicates that the model correctly classifies 99.96% of the instances. While impressive, it should be considered along with other metrics, especially for imbalanced datasets.

The precision is 97.73%. A precision of 97.73% indicates that when the model predicts a transaction as fraudulent, it is correct 97.73% of the time. High precision means fewer false positives.

The recall is 78.90%. A recall of 78.90% indicates that the model correctly identifies 78.90% of actual fraudulent transactions. Lower recall compared to precision suggests that some fraudulent cases are being missed (false negatives).

The F1-Score is 87.31%. An F1-Score of 87.31% shows a good balance between precision and recall, indicating that the model performs well in identifying frauds while minimizing false positives and false negatives.

The Matthews correlation coefficient is 0.878. An MCC of 0.878 indicates a strong correlation between the observed and predicted classifications, demonstrating the model's effectiveness. The value ranges from -1 to 1, with 1 being a perfect prediction, 0 no better than random, and -1 indicating total disagreement between prediction and observation.

topher nguyen data scientist