Exploring fraud detection through hands-on modeling and smart tradeoffs.

TL;DR

I trained a model to detect fraudulent credit card transactions, even though fraud made up less than 1% of the data. By carefully adjusting decision thresholds and evaluating metrics like precision and recall, I built a classifier that caught 80% of fraud cases while keeping false alarms low—demonstrating how machine learning can support real-world risk detection and help balance business costs associated with fraud and customer experience.

Key Skills

🌲 Ensemble Learning with Random Forests – Trained and tuned a Random Forest classifier to detect rare fraud cases in heavily imbalanced data
⚖️ Imbalanced Classification – Used precision–recall curves and threshold tuning to optimize detection of minority-class events
📉 Classifier Evaluation – Interpreted ROC AUC, F1, and confusion matrices to balance recall and false positive rate in a high-stakes context
🧪 Model Transparency & Risk Tradeoffs – Explored threshold-setting as a policy lever, connecting model outputs to real-world operational goals and business costs

What I Learned

Balancing performance metrics like precision and recall with real-world business costs is crucial for building effective, practical fraud detection models.

What I did

Have you ever had that sinking feeling? Wait, I didn’t buy that, what’s going on? Oh no…

Fraud sucks. It’s one of the few things in the tech world that’s unequivocally bad: people stealing money. No ethical grey areas; no “but it improves engagement.” Just theft. The world would be better without fraud. Period.

But as far as data problems go? Fraud is awesome. It’s tidy. It’s well-structured. And — at least if you ask the person who owns the credit card — it’s pretty easy to label: either a transaction was fraudulent, or it wasn’t. Because of this binary nature (0 = legit, 1 = fraud), fraud detection is beautifully suited for modeling and machine learning.

I happen to love binary classification problems. There’s something so… crisp about them. A light switch is either on or off. A coin lands on heads or tails. The Dark Knight is either an extremely awesome movie — or the best movie of all time.

That simplicity contrasts with continuous prediction problems, like guessing the exact temperature tomorrow. There will always be some error — and my fellow probability nerds know: the chance that a continuous variable hits one exact value is literally zero. Thanks to floating-point precision, the odds that it’ll be exactly 67.000000°F at 4:00 PM? Basically nil. It might be 67.1203948326490°F instead.

Anyway, I took on this fraud prediction challenge for two reasons:

I love classification, and
I wanted to sharpen up my understanding of classifier performance metrics.

We’ll get into what those metrics are soon, but in short: they’re how we know whether the model’s actually doing its job. Spoiler: raw accuracy doesn’t cut it — especially when you’re trying to predict something rare, like fraud.

Let’s get into it.

How I did it

I grabbed a credit card fraud dataset off of Kaggle, that you can find here.

I’ll show snippets of the Python code I used to work with the data, but here’s the first thing to know: this dataset comes fully preprocessed. That’s a big deal, because in most real-world projects, cleaning and preparing the data takes up like 80% of the work.

Since I wanted to focus on modeling and evaluation, skipping the data wrangling made sense here.

For the full Python script, see here.

First, we load the relevant libraries and the data

# Import libraries
import numpy as np
import pandas as pd
import joblib
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

# Import data
d = pd.read_csv('post_data/creditcard.csv')

Next, I’m taking a peek at a subset of the rows and columns.

In reality, we have columns V1 through V28. There’s also Time, which represents the time elapsed since the previous transaction, Amount, which is the dollar amount of the transaction, and Class, which is our target label: 0 for legit, 1 for fraud.

Now — you might look at the table preview and think, “Huh, looks like fraud is about as common as normal transactions.”

Not even close.

Only 0.17% of the 284807 transactions in this dataset are actually fraudulent. For demonstration purposes, I’ve just selected an equal number of legit and fraud rows to show side-by-side. That balance makes it easier to eyeball patterns — but it’s not how the full dataset looks.

You’ll also notice all those V columns — V1 through V28. These are the result of PCA transformations on the original transaction data.

Explaining PCA fully would take more than a sentence, but here’s the gist: PCA is like Marie Kondo for your dataset. It takes a messy pile of potentially correlated variables and distills them into a few compact, uncorrelated components that still capture most of the action — so you can model smarter with less clutter. And because these components are combinations of the original inputs, they don’t reveal sensitive details directly, which makes PCA a handy privacy-preserving step too.

I’m simplifying things a bit below by only including the V variables (dropping Time), and separating Class into its own vector that I’ll use for modeling. At this point I’m also stratifying the data into training and testing sets.

As mentioned earlier, we’re dealing with heavily imbalanced classes — the vast majority of transactions are legit, and only a tiny fraction are fraud. This imbalance can really trip up many classifiers, so we need to make a few adjustments to help the model out.

First, I use stratify = y when splitting the data. This ensures that both the training and testing sets contain the same proportion of fraud cases — crucial when those cases are rare. Without this, your test set could end up with almost no fraud at all, making it useless for evaluation.


# --- FORMAT DATA --- #

# Leave out time
features = [x for x in d.columns if x not in ['Time', 'Class']]

X = d[features].to_numpy()
y = d['Class'].to_numpy()

# Stratify y to ensure that the very few positive observations are balanced 
# across train and test
X_train_full, X_test, y_train_full, y_test = train_test_split(
        X, y, stratify=y, random_state=42
)

Next comes the tricky part: actually balancing the data for training.

There are several strategies here, but I went with undersampling the majority class. Here’s a quick analogy: imagine you’re filling a bag with red and blue marbles. You’ve got way more blue than red. To get a balanced bag, you can keep all the red marbles and randomly select just as many blue ones. Boom — balance achieved.

That’s exactly what I did. I kept all the fraud cases (the rare red marbles), and randomly sampled an equal number of legit transactions (the common blue marbles). This gives me a balanced training set with a 50/50 class split.

# --- RESAMPLE --- #

# Separate legit and fraud classes
idx_legit = np.where(y_train_full==0)[0]
idx_fraud = np.where(y_train_full==1)[0]

# Choose only the number of legit observations as there are fraudulent cases
idx_legit_sampled = np.random.choice(idx_legit, size=len(idx_fraud),
                                     replace=False)

# Put the indices back together and shuffle
idx = np.concatenate([idx_legit_sampled, idx_fraud])
np.random.shuffle(idx)

# Subsample to make balanced training data
X_train = X_train_full[idx]
y_train = y_train_full[idx]

Now it’s time to train the model. I’m going with a random forest, which is a fancy way of saying “a whole bunch of decision trees working together as a team.” This ensemble method is a classic for fraud detection, and for good reason — it’s fast, handles messy data well, and can capture complex patterns without too much tuning.

To make sure my trees aren’t just memorizing the training data (aka overfitting), I’m tuning a couple of important knobs using k-fold cross-validation — basically a systematic way of testing how well the model might generalize to unseen data.

The first knob is max_depth, which controls how deep each tree is allowed to grow. Think of a super deep tree like a conspiracy theorist — it connects everything with high confidence, but often gets it wrong in the real world. We want trees that are smart, but not paranoid.

The second knob is min_samples_leaf, which says “Hey, don’t end a decision path unless you’ve seen at least this many examples.” A higher value here helps smooth things out and prevents the model from latching onto quirks in tiny subgroups.

I’m building a forest with 300 trees — plenty to capture stable patterns without pushing my laptop into meltdown. In general, more trees = better performance, but also more compute time. So I’m balancing performance and pragmatism here.


# --- FIT RANDOM FOREST WITH GRID SEARCH CV --- #

param_grid = {
    'max_depth': [5, 10, 20, None],
    'min_samples_leaf': [1, 5, 10]
}

grid = GridSearchCV(RandomForestClassifier(n_estimators=300),
                    param_grid,
                    scoring='roc_auc',
                    cv=5)


# Load from file if it exists to reduce processing time
if not os.path.exists('post_data/grid.pkl'):
    grid.fit(X_train, y_train)
    joblib.dump(grid, 'post_data/grid.pkl')
else:
    grid = joblib.load('post_data/grid.pkl')

Next up, I take the trained model and let it loose on the test data. Unlike the balanced training set, this test data reflects the real-world class imbalance — way more legit transactions than fraud. That’s important! It means we’re now evaluating the model under the same conditions it would face in the wild. If it can spot fraud here, it’s doing something right.

What I’m getting back from the model are two things:

y_pred: which are the final yes-or-no predictions about whether a transaction is fraud, and
y_proba: which are the model’s estimated probabilities that each transaction is fraud.

Now, y_pred is what we ultimately care about—it’s the model making a decision. But having access to y_proba gives me a lot more flexibility. Instead of being locked into a single cutoff (like “anything over 0.5 = fraud”), I can explore different thresholds and see how the model performs across the board. That’s super helpful for tuning model evaluation metrics, which we’ll see next.

# --- PREDICT ON TEST DATA --- #

# Both classification and probabilities
y_pred = grid.predict(X_test)
y_proba = grid.predict_proba(X_test)[:, 1]

What I found

Now it’s time to see how well the model did at spotting fraud in the test set. Since one of my goals with this project was to brush up on classifier evaluation metrics, I decided to write my own instead of relying on the built-in ones from sklearn. I find that rolling my own forces me to really understand what’s going on under the hood.

Performance metrics for binary classification can be deceptively tricky, so it might be worth spelling some of them out and explaining why plain old accuracy doesn’t cut it.

Understanding classifier performance metrics

When we’re trying to detect a binary signal—eg, is the light switch on or off, is the transaction fraudulent or legit—there are four possible outcomes.

	Signal positive	Signal negative
Detect positive	True positive	False positive
Detect negative	False negative	True negative

These four outcomes are like the building blocks we use to judge how well the classifier is performing. Using these outcomes, we can answer questions like “How much fraud is the model catching?”, or “How often does the model say there’s fraud when there isn’t? These two metrics are, respectively, called recall and precision, and are defined below:

These four outcomes are the basic building blocks we use to evaluate model performance. From them, we can answer questions like:

“How much fraud is the model actually catching?”
“How often does it cry wolf and flag legit transactions as fraud?”

Those questions correspond to two critical metrics:

Also important is the *false positive rate*:

As we’ll see below, these metrics often trade off against each other—higher recall can mean lower precision, and vice versa. That’s why we often visualize them as curves across different decision thresholds. It makes the tradeoffs tangible.

At this point, I’m switching over to R, because when it comes to visualization, I’m a big ggplot2 head. matplotlib and seaborn are cool, but ggplot2 just hits different.

I’m showing just a snippet of my evaluation functions below. For the full R script, see here.

get_metrics <- function(y_test, y_pred) {
    # Obtain a variety of performance metrics
    
    # Performance on positive cases
    tp <- sum(y_test == 1 & y_pred == 1)
    fn <- sum(y_test == 1 & y_pred == 0)
    
    # Performance on negative cases
    tn <- sum(y_test == 0 & y_pred == 0)
    fp <- sum(y_test == 0 & y_pred == 1)
    
    # COMPUTE PRECISION, RECALL, FALSE POSITIVE RATE
    # Of all categorized positive, how many were correct?
    precision <- tp / (tp + fp)
    # Of all actual positives, how many were correctly classified?
    recall <- tp / (tp + fn)
    # Of all actual negatives, how many were correctly classified?
    fpr <- fp / (fp + tn)
    
    f1 <- get_f1(precision, recall)
    
    out <- list(precision=precision, recall=recall, fpr=fpr, f1=f1,
                fn=fn, fp=fp)
    return(out)
}

get_curve <- function(threshold, preds, score='roc') {
    # Computes either ROC curve or precision-recall curve
    y_test <- preds$y_test
    y_proba <- preds$y_proba
    y_pred <- ifelse(y_proba > threshold, 1, 0)
    
    # Compute and extract metrics
    metrics = get_metrics(y_test, y_pred)
    precision <- metrics$precision
    recall <- metrics$recall
    fpr <- metrics$fpr
    
    
    if (score == 'roc') {
        out <- data.frame(threshold=threshold, recall=recall, fpr=fpr)
    } else {
        out <- data.frame(threshold=threshold, precision=precision, recall=recall)
    }
    
    return(out)
}

Visualizing classifier performance metrics

I’m first plotting the distribution of the model’s predicted probabilities. Recall that the vast majority of transactions in the dataset are legitimate, so we’d expect most probabilities to be around zero. And that’s what we see here. Good sanity check.

preds <- py$preds
p <- preds %>% 
    ggplot(aes(x = y_proba)) +
    geom_histogram(fill = 'steelblue', color = 'black') + 
    labs(
        x = 'Probability of fraud',
        y = 'Frequency'
    ) + 
    theme_bw() + 
    theme(panel.grid = element_blank(),
          axis.ticks = element_blank(),
          text = element_text(size = text_size))
    
ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next, I’m lookng at how the model’s predicted probabilities map on to actual fraud transactions. The size of the dot indicates how many transactions there are and, as expected, most transactions are legitimate.

bins <- cut(preds$y_proba, breaks = seq(0, 1, length.out=50), include.lowest=TRUE)

preds$bins <- bins

p <- preds %>% 
    group_by(bins) %>% 
    summarize(y_proba = mean(y_proba), y_prop = mean(y_test), count = n()) %>% 
    ggplot(aes(x = y_proba, y = y_prop)) + 
    geom_point(aes(text = paste0('Observations: ', count), size = count, color = count)) + 
    labs(
        x = 'Mean predicted value',
        y = 'Proportion of frauds',
        color = 'Number of\nobservations'
    ) +
    theme_bw() + 
    theme(panel.grid = element_blank(),
          axis.ticks = element_blank(),
          text = element_text(size = text_size))
## Warning in geom_point(aes(text = paste0("Observations: ", count), size = count,
## : Ignoring unknown aesthetics: text

ggplotly(p, tooltip = 'text')

We see that the model assigns low-to-medium probabilities to non-fraud transactions. As the incidence of fraud increases, the predicted probability quickly spikes.

Next I’m plotting what’s called the ROC curve, which is a fancy name for visualizing the true positive rate (ie, recall) over the false positive rate. Each point on this plot represents a different level of “thresholding” the model’s predicted probabilities.

Thresholding is just the process of deciding how high a predicted probability needs to be before we say, “yep, this one’s fraud.” The model gives us probabilities between 0 and 1 for each transaction—thresholding is where we draw the line. If we set the threshold low, we’ll catch more fraud (high recall), but we might also flag more legit transactions by mistake (higher false positives). A higher threshold does the opposite. By adjusting this threshold, we control how sensitive the model is.

Good thresholds are the ones that push the model’s performance up toward the top-left corner of the plot—that means it’s catching lots of fraud (true positives) while rarely crying wolf (false positives). You can hover over each point (and click and drag to zoom) to explore the threshold values behind the scenes.

thresholds <- seq(0, 1, .01)

roc <- do.call(rbind, lapply(thresholds, get_curve, preds))

green <- qual[4]

auc <- trapz_auc(roc$fpr, roc$recall)

p <- roc %>% 
    ggplot(aes(x = fpr, y = recall)) + 
    geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'lightgrey') + 
    geom_point(color = green, aes(text = paste0('FPR: ', round(fpr,3), '\nTPR: ', round(recall, 2),
                '\nThreshold: ', threshold, '\nAUC: ', round(auc, 3)))) + 
    labs(
        x = 'False positive rate',
        y = 'True positive rate (Recall)'
    ) + 
    annotate('text', x = .5, y = .7, label = paste0('AUC: ', round(auc, 3)), size = 6) + 
    theme_bw() + 
    theme(axis.ticks = element_blank(),
          text = element_text(size = text_size))
## Warning in geom_point(color = green, aes(text = paste0("FPR: ", round(fpr, :
## Ignoring unknown aesthetics: text

ggplotly(p, tooltip = 'text')

The dashed diagonal line shows what you’d get if the model were just guessing randomly—no better than flipping a coin. The AUC, or Area Under the Curve, sums up how much better we’re doing than that. More area under the curve = better model. Simple as that.

This one’s especially important in fraud detection because of the steep class imbalance—ROC curves can give an overly rosy picture when most cases are legitimate, but precision–recall plots keep the focus where it matters: how well we’re catching the rare frauds.

The precision (y-axis) tells us the proportion of our fraud predictions that are actually correct. Lower precision means we’re bugging more legit customers unnecessarily. The recall (x-axis) tells us the proportion of true frauds our model is catching. Lower recall means more fraud cases are slipping through.

pr <- do.call(rbind, lapply(thresholds, get_curve, preds, score='pr'))

labels <- seq(0, 1, .1)
labels[seq(2, length(labels), 2)] <- ''

p <- pr %>% 
    ggplot(aes(x = recall, y = precision)) + 
    geom_point(color = green, aes(text = paste0('Precision: ', precision,
                                                '\nRecall: ', recall,
                                                '\nThreshold: ', threshold))) + 
    scale_x_continuous(breaks = seq(0, 1, .1)) +
    scale_y_continuous(breaks = seq(0, 1, .1), limits = c(0, 1)) +
    labs(
        x = 'Proportion of true fraud detected (Recall)',
        y = 'Proportion of correct\nfraud detections (Precision)'
    ) +
    theme_bw() + 
    theme(axis.ticks = element_blank(),
          text = element_text(size = text_size))
## Warning in geom_point(color = green, aes(text = paste0("Precision: ",
## precision, : Ignoring unknown aesthetics: text
    
ggplotly(p, tooltip = 'text')

We can see that as we lower the decision threshold—meaning we’re more willing to label transactions as fraud—recall goes up: we catch more true fraud. But precision drops, because we also start flagging more legit transactions by mistake. There’s a tradeoff. A pretty high threshold (~0.95) gives us a balance of around 0.75 precision and 0.75 recall—not bad.

The next plot brings in a business perspective. Suppose we talked to the fraud and customer care teams, and they gave us rough estimates for the cost of each kind of error. For example, let’s say it costs $10 to investigate a legit transaction that was mistakenly flagged, and $500 for every missed fraud case. Given these estimates, we can compute the total cost of mistakes for every decision threshold.

That’s what I’m plotting here. Each point represents a threshold and its corresponding total cost. We can hover to see the numbers, and identify the threshold that minimizes cost overall. In this case, that optimal point lands at a decision threshold of 0.82.

thresholds <- seq(0, 1, .01)
type1_cost <- 10
type2_cost <- 500

costs <- do.call(rbind, lapply(thresholds, FUN=get_cost, preds, type1_cost, type2_cost))
optimum <- costs[costs$cost == min(costs$cost),]$threshold
mark <- data.frame(threshold = .82, cost = 1.5e+05, 
                   label = paste0('Optimal decision\nthreshold: ', optimum))

p <- costs %>% 
    mutate(optimal = ifelse(cost == min(cost), 'yes', 'no')) %>% 
    ggplot(aes(x = threshold, y = cost)) + 
    geom_point(aes(size = optimal, color = optimal, shape = optimal,
                   text = paste0('Decision threshold: ', threshold, '\nCost: ', cost))) + 
    geom_text(data = mark, aes(label = label)) + 
    labs(
        x = 'Decision threshold',
        y = 'Cost ($)'
    ) + 
    scale_size_manual(values = c(`no` = 1, `yes` = 7)) +
    scale_shape_manual(values = c(`no` = 16, `yes` = 8)) +
    scale_color_manual(values = c(`no` = 'black', `yes` = 'green')) +
    theme_bw() + 
    theme(legend.position = 'none',
          panel.grid = element_blank(),
          axis.ticks = element_blank(),
          text = element_text(size = text_size))
## Warning in geom_point(aes(size = optimal, color = optimal, shape = optimal, :
## Ignoring unknown aesthetics: text

ggplotly(p, tooltip = 'text')

Using that 0.82 threshold, we can calculate how the model performs in terms of raw outcomes. Specifically, we can count how many legit transactions were correctly ignored, how many were falsely flagged, and how many fraud cases were correctly or incorrectly classified. This summary is shown in a table known as a confusion matrix:

	Detected legitimate	Detected fraud
Truly legitimate	70946	133
Truly fraud	18	105

Out of around 71,000 transactions, we missed 18 frauds and unnecessarily flagged 133 legit ones. That’s a pretty decent balance—especially considering how rare fraud is in the dataset.

What it means

This model shows promising potential for real-world fraud detection, especially given how challenging the problem is with less than 1% of transactions actually fraudulent. By carefully tuning the decision threshold, we can balance catching most fraud cases while minimizing the number of false alarms that inconvenience legitimate customers.

The precision–recall curve highlights the inherent tradeoff: if we want to catch nearly all fraud (high recall), we’ll inevitably flag more legit transactions by mistake (lower precision). But by incorporating real business costs for these errors, we can choose a threshold that minimizes overall loss—not just errors in isolation.

At the optimal threshold we identified, the model would catch roughly 80% of fraud cases while only bothering a small fraction of legitimate customers. Missing 18 fraud cases out of 71,000 transactions is far from perfect, but it’s a solid starting point that significantly reduces risk compared to ignoring fraud altogether. And only 133 false alarms means the customer experience stays mostly smooth.

In practice, this threshold could be adjusted dynamically based on operational priorities—say, during peak seasons when customer friction is especially costly, or when fraud activity spikes. The model’s outputs provide a flexible lever for risk management teams.

Overall, this project illustrates how machine learning combined with thoughtful evaluation metrics and business context can produce tools that meaningfully support fraud prevention efforts. It’s a reminder that performance numbers alone don’t tell the full story—understanding costs and consequences is key to deploying models that truly add value.

Return to homepage