Exploring fraud detection through hands-on modeling and smart tradeoffs.
đž Download the data
TL;DR
I trained a model to detect fraudulent credit card transactions, even though fraud made up less than 1% of the data. By carefully adjusting decision thresholds and evaluating metrics like precision and recall, I built a classifier that caught 80% of fraud cases while keeping false alarms lowâdemonstrating how machine learning can support real-world risk detection and help balance business costs associated with fraud and customer experience.
Key Skills
-
đ˛ Ensemble Learning with Random Forests â Trained and tuned a Random Forest classifier to detect rare fraud cases in heavily imbalanced data
-
âď¸ Imbalanced Classification â Used precisionârecall curves and threshold tuning to optimize detection of minority-class events
-
đ Classifier Evaluation â Interpreted ROC AUC, F1, and confusion matrices to balance recall and false positive rate in a high-stakes context
-
đ§Ş Model Transparency & Risk Tradeoffs â Explored threshold-setting as a policy lever, connecting model outputs to real-world operational goals and business costs
What I Learned
Balancing performance metrics like precision and recall with real-world business costs is crucial for building effective, practical fraud detection models.
What I did
Have you ever had that sinking feeling? Wait, I didnât buy that, whatâs going on? Oh noâŚ
Fraud sucks. Itâs one of the few things in the tech world thatâs unequivocally bad: people stealing money. No ethical grey areas; no âbut it improves engagement.â Just theft. The world would be better without fraud. Period.

But as far as data problems go? Fraud is awesome. Itâs tidy. Itâs well-structured. And â at least if you ask the person who owns the credit card â itâs pretty easy to label: either a transaction was fraudulent, or it wasnât. Because of this binary nature (0 = legit, 1 = fraud), fraud detection is beautifully suited for modeling and machine learning.
I happen to love binary classification problems. Thereâs something so⌠crisp about them. A light switch is either on or off. A coin lands on heads or tails. The Dark Knight is either an extremely awesome movie â or the best movie of all time.
That simplicity contrasts with continuous prediction problems, like guessing the exact temperature tomorrow. There will always be some error â and my fellow probability nerds know: the chance that a continuous variable hits one exact value is literally zero. Thanks to floating-point precision, the odds that itâll be exactly 67.000000°F at 4:00 PM? Basically nil. It might be 67.1203948326490°F instead.
Anyway, I took on this fraud prediction challenge for two reasons:
-
I love classification, and
-
I wanted to sharpen up my understanding of classifier performance metrics.
Weâll get into what those metrics are soon, but in short: theyâre how we know whether the modelâs actually doing its job. Spoiler: raw accuracy doesnât cut it â especially when youâre trying to predict something rare, like fraud.
Letâs get into it.
How I did it
I grabbed a credit card fraud dataset off of Kaggle, that you can find here.
Iâll show snippets of the Python code I used to work with the data, but hereâs the first thing to know: this dataset comes fully preprocessed. Thatâs a big deal, because in most real-world projects, cleaning and preparing the data takes up like 80% of the work.
Since I wanted to focus on modeling and evaluation, skipping the data wrangling made sense here.
For the full Python script, see here.
First, we load the relevant libraries and the data
# Import libraries
import numpy as np
import pandas as pd
import joblib
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
# Import data
d = pd.read_csv('post_data/creditcard.csv')
Next, Iâm taking a peek at a subset of the rows and columns.
In reality, we have columns V1
through V28
. Thereâs also Time
, which represents
the time elapsed since the previous transaction, Amount
, which is the dollar
amount of the transaction, and Class
, which is our target label: 0
for legit, 1
for fraud.
Now â you might look at the table preview and think, âHuh, looks like fraud is about as common as normal transactions.â
Not even close.
Only 0.17% of the 284807 transactions in this dataset are actually fraudulent. For demonstration purposes, Iâve just selected an equal number of legit and fraud rows to show side-by-side. That balance makes it easier to eyeball patterns â but itâs not how the full dataset looks.
Youâll also notice all those V
columns â V1
through V28
. These are the result of
PCA transformations on the original transaction data.
Explaining PCA fully would take more than a sentence, but hereâs the gist: PCA is like Marie Kondo for your dataset. It takes a messy pile of potentially correlated variables and distills them into a few compact, uncorrelated components that still capture most of the action â so you can model smarter with less clutter. And because these components are combinations of the original inputs, they donât reveal sensitive details directly, which makes PCA a handy privacy-preserving step too.
Iâm simplifying things a bit below by only including the V
variables (dropping
Time
), and separating Class
into its own vector that Iâll use for modeling.
At this point Iâm also stratifying the data into training and testing sets.
As mentioned earlier, weâre dealing with heavily imbalanced classes â the vast majority of transactions are legit, and only a tiny fraction are fraud. This imbalance can really trip up many classifiers, so we need to make a few adjustments to help the model out.
First, I use stratify = y
when splitting the data. This ensures that both the
training and testing sets contain the same proportion of fraud cases â crucial
when those cases are rare. Without this, your test set could end up with almost
no fraud at all, making it useless for evaluation.
# --- FORMAT DATA --- #
# Leave out time
features = [x for x in d.columns if x not in ['Time', 'Class']]
X = d[features].to_numpy()
y = d['Class'].to_numpy()
# Stratify y to ensure that the very few positive observations are balanced
# across train and test
X_train_full, X_test, y_train_full, y_test = train_test_split(
X, y, stratify=y, random_state=42
)
Next comes the tricky part: actually balancing the data for training.
There are several strategies here, but I went with undersampling the majority class. Hereâs a quick analogy: imagine youâre filling a bag with red and blue marbles. Youâve got way more blue than red. To get a balanced bag, you can keep all the red marbles and randomly select just as many blue ones. Boom â balance achieved.
Thatâs exactly what I did. I kept all the fraud cases (the rare red marbles), and randomly sampled an equal number of legit transactions (the common blue marbles). This gives me a balanced training set with a 50/50 class split.
# --- RESAMPLE --- #
# Separate legit and fraud classes
idx_legit = np.where(y_train_full==0)[0]
idx_fraud = np.where(y_train_full==1)[0]
# Choose only the number of legit observations as there are fraudulent cases
idx_legit_sampled = np.random.choice(idx_legit, size=len(idx_fraud),
replace=False)
# Put the indices back together and shuffle
idx = np.concatenate([idx_legit_sampled, idx_fraud])
np.random.shuffle(idx)
# Subsample to make balanced training data
X_train = X_train_full[idx]
y_train = y_train_full[idx]
Now itâs time to train the model. Iâm going with a random forest, which is a fancy way of saying âa whole bunch of decision trees working together as a team.â This ensemble method is a classic for fraud detection, and for good reason â itâs fast, handles messy data well, and can capture complex patterns without too much tuning.
To make sure my trees arenât just memorizing the training data (aka overfitting), Iâm tuning a couple of important knobs using k-fold cross-validation â basically a systematic way of testing how well the model might generalize to unseen data.
The first knob is max_depth
, which controls how deep each tree is allowed to grow. Think of a super deep tree like a conspiracy theorist â it connects everything with high confidence, but often gets it wrong in the real world. We want trees that are smart, but not paranoid.
The second knob is min_samples_leaf
, which says âHey, donât end a decision path unless youâve seen at least this many examples.â A higher value here helps smooth things out and prevents the model from latching onto quirks in tiny subgroups.
Iâm building a forest with 300 trees â plenty to capture stable patterns without pushing my laptop into meltdown. In general, more trees = better performance, but also more compute time. So Iâm balancing performance and pragmatism here.
# --- FIT RANDOM FOREST WITH GRID SEARCH CV --- #
param_grid = {
'max_depth': [5, 10, 20, None],
'min_samples_leaf': [1, 5, 10]
}
grid = GridSearchCV(RandomForestClassifier(n_estimators=300),
param_grid,
scoring='roc_auc',
cv=5)
# Load from file if it exists to reduce processing time
if not os.path.exists('post_data/grid.pkl'):
grid.fit(X_train, y_train)
joblib.dump(grid, 'post_data/grid.pkl')
else:
grid = joblib.load('post_data/grid.pkl')
Next up, I take the trained model and let it loose on the test data. Unlike the balanced training set, this test data reflects the real-world class imbalance â way more legit transactions than fraud. Thatâs important! It means weâre now evaluating the model under the same conditions it would face in the wild. If it can spot fraud here, itâs doing something right.
What Iâm getting back from the model are two things:
-
y_pred
: which are the final yes-or-no predictions about whether a transaction is fraud, and -
y_proba
: which are the modelâs estimated probabilities that each transaction is fraud.
Now, y_pred
is what we ultimately care aboutâitâs the model making a decision.
But having access to y_proba
gives me a lot more flexibility. Instead of being
locked into a single cutoff (like âanything over 0.5 = fraudâ), I can explore
different thresholds and see how the model performs across the board. Thatâs
super helpful for tuning model evaluation metrics, which weâll see next.
# --- PREDICT ON TEST DATA --- #
# Both classification and probabilities
y_pred = grid.predict(X_test)
y_proba = grid.predict_proba(X_test)[:, 1]
What I found
Now itâs time to see how well the model did at spotting fraud in the test set.
Since one of my goals with this project was to brush up on classifier evaluation
metrics, I decided to write my own instead of relying on the built-in ones from
sklearn
. I find that rolling my own forces me to really understand whatâs going
on under the hood.
Performance metrics for binary classification can be deceptively tricky, so it might be worth spelling some of them out and explaining why plain old accuracy doesnât cut it.
Understanding classifier performance metrics
When weâre trying to detect a binary signalâeg, is the light switch on or off, is the transaction fraudulent or legitâthere are four possible outcomes.
Signal positive | Signal negative | |
---|---|---|
Detect positive | True positive | False positive |
Detect negative | False negative | True negative |
These four outcomes are like the building blocks we use to judge how well the classifier is performing. Using these outcomes, we can answer questions like âHow much fraud is the model catching?â, or âHow often does the model say thereâs fraud when there isnât? These two metrics are, respectively, called recall and precision, and are defined below:
These four outcomes are the basic building blocks we use to evaluate model performance. From them, we can answer questions like:
-
âHow much fraud is the model actually catching?â
-
âHow often does it cry wolf and flag legit transactions as fraud?â
Those questions correspond to two critical metrics:



As weâll see below, these metrics often trade off against each otherâhigher recall can mean lower precision, and vice versa. Thatâs why we often visualize them as curves across different decision thresholds. It makes the tradeoffs tangible.
At this point, Iâm switching over to R, because when it comes to visualization,
Iâm a big ggplot2
head. matplotlib
and seaborn
are cool, but ggplot2
just
hits different.
Iâm showing just a snippet of my evaluation functions below. For the full R script, see here.
get_metrics <- function(y_test, y_pred) {
# Obtain a variety of performance metrics
# Performance on positive cases
tp <- sum(y_test == 1 & y_pred == 1)
fn <- sum(y_test == 1 & y_pred == 0)
# Performance on negative cases
tn <- sum(y_test == 0 & y_pred == 0)
fp <- sum(y_test == 0 & y_pred == 1)
# COMPUTE PRECISION, RECALL, FALSE POSITIVE RATE
# Of all categorized positive, how many were correct?
precision <- tp / (tp + fp)
# Of all actual positives, how many were correctly classified?
recall <- tp / (tp + fn)
# Of all actual negatives, how many were correctly classified?
fpr <- fp / (fp + tn)
f1 <- get_f1(precision, recall)
out <- list(precision=precision, recall=recall, fpr=fpr, f1=f1,
fn=fn, fp=fp)
return(out)
}
get_curve <- function(threshold, preds, score='roc') {
# Computes either ROC curve or precision-recall curve
y_test <- preds$y_test
y_proba <- preds$y_proba
y_pred <- ifelse(y_proba > threshold, 1, 0)
# Compute and extract metrics
metrics = get_metrics(y_test, y_pred)
precision <- metrics$precision
recall <- metrics$recall
fpr <- metrics$fpr
if (score == 'roc') {
out <- data.frame(threshold=threshold, recall=recall, fpr=fpr)
} else {
out <- data.frame(threshold=threshold, precision=precision, recall=recall)
}
return(out)
}
Visualizing classifier performance metrics
Iâm first plotting the distribution of the modelâs predicted probabilities. Recall that the vast majority of transactions in the dataset are legitimate, so weâd expect most probabilities to be around zero. And thatâs what we see here. Good sanity check.
preds <- py$preds
p <- preds %>%
ggplot(aes(x = y_proba)) +
geom_histogram(fill = 'steelblue', color = 'black') +
labs(
x = 'Probability of fraud',
y = 'Frequency'
) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.ticks = element_blank(),
text = element_text(size = text_size))
ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, Iâm lookng at how the modelâs predicted probabilities map on to actual fraud transactions. The size of the dot indicates how many transactions there are and, as expected, most transactions are legitimate.
bins <- cut(preds$y_proba, breaks = seq(0, 1, length.out=50), include.lowest=TRUE)
preds$bins <- bins
p <- preds %>%
group_by(bins) %>%
summarize(y_proba = mean(y_proba), y_prop = mean(y_test), count = n()) %>%
ggplot(aes(x = y_proba, y = y_prop)) +
geom_point(aes(text = paste0('Observations: ', count), size = count, color = count)) +
labs(
x = 'Mean predicted value',
y = 'Proportion of frauds',
color = 'Number of\nobservations'
) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.ticks = element_blank(),
text = element_text(size = text_size))
## Warning in geom_point(aes(text = paste0("Observations: ", count), size = count,
## : Ignoring unknown aesthetics: text
ggplotly(p, tooltip = 'text')
We see that the model assigns low-to-medium probabilities to non-fraud transactions. As the incidence of fraud increases, the predicted probability quickly spikes.
Next Iâm plotting whatâs called the ROC curve, which is a fancy name for visualizing the true positive rate (ie, recall) over the false positive rate. Each point on this plot represents a different level of âthresholdingâ the modelâs predicted probabilities.
Thresholding is just the process of deciding how high a predicted probability needs to be before we say, âyep, this oneâs fraud.â The model gives us probabilities between 0 and 1 for each transactionâthresholding is where we draw the line. If we set the threshold low, weâll catch more fraud (high recall), but we might also flag more legit transactions by mistake (higher false positives). A higher threshold does the opposite. By adjusting this threshold, we control how sensitive the model is.
Good thresholds are the ones that push the modelâs performance up toward the top-left corner of the plotâthat means itâs catching lots of fraud (true positives) while rarely crying wolf (false positives). You can hover over each point (and click and drag to zoom) to explore the threshold values behind the scenes.
thresholds <- seq(0, 1, .01)
roc <- do.call(rbind, lapply(thresholds, get_curve, preds))
green <- qual[4]
auc <- trapz_auc(roc$fpr, roc$recall)
p <- roc %>%
ggplot(aes(x = fpr, y = recall)) +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'lightgrey') +
geom_point(color = green, aes(text = paste0('FPR: ', round(fpr,3), '\nTPR: ', round(recall, 2),
'\nThreshold: ', threshold, '\nAUC: ', round(auc, 3)))) +
labs(
x = 'False positive rate',
y = 'True positive rate (Recall)'
) +
annotate('text', x = .5, y = .7, label = paste0('AUC: ', round(auc, 3)), size = 6) +
theme_bw() +
theme(axis.ticks = element_blank(),
text = element_text(size = text_size))
## Warning in geom_point(color = green, aes(text = paste0("FPR: ", round(fpr, :
## Ignoring unknown aesthetics: text
ggplotly(p, tooltip = 'text')
The dashed diagonal line shows what youâd get if the model were just guessing randomlyâno better than flipping a coin. The AUC, or Area Under the Curve, sums up how much better weâre doing than that. More area under the curve = better model. Simple as that.
This oneâs especially important in fraud detection because of the steep class imbalanceâROC curves can give an overly rosy picture when most cases are legitimate, but precisionârecall plots keep the focus where it matters: how well weâre catching the rare frauds.
The precision (y-axis) tells us the proportion of our fraud predictions that are actually correct. Lower precision means weâre bugging more legit customers unnecessarily. The recall (x-axis) tells us the proportion of true frauds our model is catching. Lower recall means more fraud cases are slipping through.
pr <- do.call(rbind, lapply(thresholds, get_curve, preds, score='pr'))
labels <- seq(0, 1, .1)
labels[seq(2, length(labels), 2)] <- ''
p <- pr %>%
ggplot(aes(x = recall, y = precision)) +
geom_point(color = green, aes(text = paste0('Precision: ', precision,
'\nRecall: ', recall,
'\nThreshold: ', threshold))) +
scale_x_continuous(breaks = seq(0, 1, .1)) +
scale_y_continuous(breaks = seq(0, 1, .1), limits = c(0, 1)) +
labs(
x = 'Proportion of true fraud detected (Recall)',
y = 'Proportion of correct\nfraud detections (Precision)'
) +
theme_bw() +
theme(axis.ticks = element_blank(),
text = element_text(size = text_size))
## Warning in geom_point(color = green, aes(text = paste0("Precision: ",
## precision, : Ignoring unknown aesthetics: text
ggplotly(p, tooltip = 'text')
We can see that as we lower the decision thresholdâmeaning weâre more willing to label transactions as fraudârecall goes up: we catch more true fraud. But precision drops, because we also start flagging more legit transactions by mistake. Thereâs a tradeoff. A pretty high threshold (~0.95) gives us a balance of around 0.75 precision and 0.75 recallânot bad.
The next plot brings in a business perspective. Suppose we talked to the fraud and customer care teams, and they gave us rough estimates for the cost of each kind of error. For example, letâs say it costs $10 to investigate a legit transaction that was mistakenly flagged, and $500 for every missed fraud case. Given these estimates, we can compute the total cost of mistakes for every decision threshold.
Thatâs what Iâm plotting here. Each point represents a threshold and its corresponding total cost. We can hover to see the numbers, and identify the threshold that minimizes cost overall. In this case, that optimal point lands at a decision threshold of 0.82.
thresholds <- seq(0, 1, .01)
type1_cost <- 10
type2_cost <- 500
costs <- do.call(rbind, lapply(thresholds, FUN=get_cost, preds, type1_cost, type2_cost))
optimum <- costs[costs$cost == min(costs$cost),]$threshold
mark <- data.frame(threshold = .82, cost = 1.5e+05,
label = paste0('Optimal decision\nthreshold: ', optimum))
p <- costs %>%
mutate(optimal = ifelse(cost == min(cost), 'yes', 'no')) %>%
ggplot(aes(x = threshold, y = cost)) +
geom_point(aes(size = optimal, color = optimal, shape = optimal,
text = paste0('Decision threshold: ', threshold, '\nCost: ', cost))) +
geom_text(data = mark, aes(label = label)) +
labs(
x = 'Decision threshold',
y = 'Cost ($)'
) +
scale_size_manual(values = c(`no` = 1, `yes` = 7)) +
scale_shape_manual(values = c(`no` = 16, `yes` = 8)) +
scale_color_manual(values = c(`no` = 'black', `yes` = 'green')) +
theme_bw() +
theme(legend.position = 'none',
panel.grid = element_blank(),
axis.ticks = element_blank(),
text = element_text(size = text_size))
## Warning in geom_point(aes(size = optimal, color = optimal, shape = optimal, :
## Ignoring unknown aesthetics: text
ggplotly(p, tooltip = 'text')
Using that 0.82 threshold, we can calculate how the model performs in terms of raw outcomes. Specifically, we can count how many legit transactions were correctly ignored, how many were falsely flagged, and how many fraud cases were correctly or incorrectly classified. This summary is shown in a table known as a confusion matrix:
Detected legitimate | Detected fraud | |
---|---|---|
Truly legitimate | 70946 | 133 |
Truly fraud | 18 | 105 |
Out of around 71,000 transactions, we missed 18 frauds and unnecessarily flagged 133 legit ones. Thatâs a pretty decent balanceâespecially considering how rare fraud is in the dataset.
What it means
This model shows promising potential for real-world fraud detection, especially given how challenging the problem is with less than 1% of transactions actually fraudulent. By carefully tuning the decision threshold, we can balance catching most fraud cases while minimizing the number of false alarms that inconvenience legitimate customers.
The precisionârecall curve highlights the inherent tradeoff: if we want to catch nearly all fraud (high recall), weâll inevitably flag more legit transactions by mistake (lower precision). But by incorporating real business costs for these errors, we can choose a threshold that minimizes overall lossânot just errors in isolation.
At the optimal threshold we identified, the model would catch roughly 80% of fraud cases while only bothering a small fraction of legitimate customers. Missing 18 fraud cases out of 71,000 transactions is far from perfect, but itâs a solid starting point that significantly reduces risk compared to ignoring fraud altogether. And only 133 false alarms means the customer experience stays mostly smooth.
In practice, this threshold could be adjusted dynamically based on operational prioritiesâsay, during peak seasons when customer friction is especially costly, or when fraud activity spikes. The modelâs outputs provide a flexible lever for risk management teams.
Overall, this project illustrates how machine learning combined with thoughtful evaluation metrics and business context can produce tools that meaningfully support fraud prevention efforts. Itâs a reminder that performance numbers alone donât tell the full storyâunderstanding costs and consequences is key to deploying models that truly add value.