An Evaluation of Autoencoder Architectures for Fraud Detection in Credit Card Transactions

By Karan Gupta on

April 25, 2026

A guide to understand different autoencoder architectures for credit card anomalies

According to a 2024 Federal Trade commission report in 2024, consumers in the US lost more than $12.5 bn to financial fraud [1]. In another example, UK Finance trade association, £1.17 bn was lost to fraud in 2024 while £1.45 Bn of unauthorized fraud was prevented by industry [2].

The banking industry has undertaken sophisticated efforts like the use of machine learning to detect fraud. However, credit transactions fraud detection comes with a unique set of challenges [3].

The transaction volume is extremely large, and the class distribution is highly imbalanced. Significant class imbalance that makes the traditional ML evaluation methods less effective
Fraudsters continuously evolve with protocols and find innovative ways to conduct fraud, requiring the models to re-learn the new patterns
The data is subject to confidentiality and contains personal identifiable information and hence cannot be released without masking.

Why Autoencoders?

Anomalies are defined as observations that deviate from normal patterns in the dataset. Traditional supervised learning algorithms aim to optimize for global accuracy and hence develop bias towards majority class (normal transactions) and frequently identify fraud as benign.

Autoencoders are commonly explored for fraud detection as they learn compressed representations of high-dimensional transaction data. Autoencoders reframe fraud detection as an anomaly detection problem by learning the pattern on normal transactions and captures compressed low dimensional pattern that captures nonlinear relationship in purchases.

This makes the model highly proficient in recreating normal transactions but fails to fully map the fraudulent transactions leading to high reconstruction error.

How does an autoencoder work?

Originally developed for data compression, the fundamental principle is the dimensionality reduction process where the input data is mapped by an encoder through hidden layers. The encoder transforms the input x into a latest representation z. Subsequently the decoder compresses the code to produce output data with dimension d0. The primary architecture comprises of

Encoder – Compresses the input data into lower dimensional representation (bottleneck space)
Non-Linear Transformations – Various nonlinear transformations are done in the middle layers to recognize complex structures
Latent Space/Bottleneck– In this state the data has only the relevant information to be recreated in next stage
Decoder – Decoder takes the compressed representation and reconstructs the original input
Epochs – They refer to the number of full passes thorough the training dataset during optimization

Common Metrics

Reconstruction Error (Mean Squared Error) – This is the average of the squared difference between the original feature values and the reconstructed feature
Precision – This is the ratio of true positives out of all positives identified by the model
Recall – Also called sensitivity or true positive rate is the ratio of true positive predictions from all actual positive results in the dataset
F1 score – F1 score takes into account of both precision and recall and is a weighted harmonic mean of both. It means that if either of precision or recall is low then F1 score will also be low
AUROC (Area under receiver operating characteristic) – This indicates model’s accuracy in ranking positives above negatives at all thresholds. This can be misleading in highly imbalanced datasets
AUPRC (Area under precision recall curve) – How good is the model is predicting true positives while keeping false positives lower

Dataset, Preprocessing and Assumptions

The dataset is a highly unbalanced set of transactions containing 492 frauds out of 284,807 transactions [4]. The positive fraud cases are of 0.172% of transactions. The target variable of class represents whether a transaction is legitimate (0) or fraud (1). Fraud transactions make up a fraction of the dataset. Due to confidential nature of data, there are 28 standardized principal components along with Time and Amount.

Fig 1: Distribution of Normal vs Fraud classes

The scales of the engineered numerical features can vary dramatically with some clustering around zero. Since reconstruction error is scale dependent, the auto encoder can focus on the large magnitude feature which can dominate the resulting error value.

Fig 2: Feature distribution snapshot

Looking at the feature distribution snapshot, we can see that most features are tightly distributed, and the dataset is highly compressed. Nevertheless, since reconstruction error is sensitive to scaling, Standard Scaler was applied to all features.

In order to evaluate the model fairly, the dataset is split into training (70%), validation (15%) and test (15%) subsets. Only normal transactions used for training which is needed to ensure auto encoder learns normal behavior.

Assumptions:

The dataset reflects real-world transaction behavior and contains accurate fraud labels.
Fraud is rare, so models are trained primarily on normal transactions (unsupervised anomaly detection).
Reconstruction error is a valid signal for identifying anomalous or fraudulent behavior.
Data is properly scaled and preprocessed before training all autoencoder models.
Validation data is representative of production patterns, allowing thresholds to generalize effectively.
For consistency across all architectures, anomaly thresholds were selected using the 99.5th percentile of the validation reconstruction error distribution.

Cross Architectural Evaluation & Performance

Here we will explain the basic architecture, and results for each type of autoencoder

Baseline Autoencoder

This is simplest reference point in our comparison and uses a basic model. We assume a shallow encoder -decoder structure with a single hidden layer on each side and a 4-dimensional bottleneck. The baseline model assumes that normal transactions can be reconstructed with minimal loss while anomalous transactions deviate and have high reconstruction error. The architecture also assumes that all input features contribute equally and relationship between variables can be captured thorough dense feedforward layers [4].

Theoretical Hypothesis

The baseline architecture serves as our control model, and it tests the foundational premise that a shallow bottleneck is sufficient to capture the behavior of normal transactions. By restricting the dimensional space, the hypothesis is the model would successfully compress and reconstruct the pattern of legitimate transactions while simultaneously failing to reconstruct anomalous fraud signatures.

Architecture & Results

Layers: 34-8-4-8-34 | Activation – ReLU | Loss – MSE
Parameters – Total (662), Trainable (662)

The following are the results of the basic autoencoder model -

Table 1 – Results of base autoencoder

The AUPRC of 0.1441 for baseline autoencoder indicates that the model cannot distinguish between the two classes well. The model yields a precision of 0.15 and a recall of 0.46, resulting in an F1 score of 0.23. This means that the model is able to identify 46% of fraudulent transactions but only 15% are flagged as actual fraud, leading to higher number of false positives.

The precision recall curve quickly declines as recall increases signaling that model struggles to maintain precision.

The reconstruction error plot also shows significant overlap between both classes. Most transactions are concentrated in a narrow area near zero with a small number of observations extending into higher error regions. The threshold lies far into to the tail end of the distribution indicating that only extreme cases are flagged as anomalies. The threshold vs precision recall curve highlights another limitation of threshold instability with small changes in threshold leading to large fluctuations in precision and recall.

The cumulative gains curve illustrates how effectively the model ranks transactions by anomaly score. The model shows a sharp initial rise meaning that a large fraction of fraud is identified with the first 10%-20% of transactions and then flattens indicating that the model struggles to identify extreme anomalies

The baseline autoencoder is able to learn pattern and identify extreme anomalies but fails to distinguish moderate anomalous transactions, resulting in poor and unstable decision boundaries.

Deeper Dense Autoencoder

Deep autoencoder increases the depth of the model by adding additional layers to both encoder and decoder and enhances the capacity to learn complex data patterns. In deep autoencoder architecture, the input data goes to various layers of encoder with each layer capturing different levels of detail.

At the deepest layers, only the most vital characteristic of data is preserved. This representation is then steadily augmented by adding the details back through the decoder layers. The benefits of adding more layers are greater ability to capture complex patterns with initial layers capturing the basic details while deeper layers identify nuances to differentiate fraud. Deeper autoencoders can model more complex nonlinear structure, which is particularly useful in domains such as image representation and feature compression [5].

Theoretical Hypothesis

Datasets that have undergone PCA transformations may has highly complex non-linear relationships that a shallow network may underfit [5]. With deep autoencoder, we address this issue by adding multiple dense layers to ensure that the model can learn a nuanced multi layered representation of legitimate behavior.

Architecture & Results

Layers: 34-32-16-4-32-34| Activation – ReLU | Loss – MSE, Epochs -50
Parameters – Total (3,462), Trainable (3,462)

The following are the results -

Table 2 – Results of Deep Dense autoencoder

The deep autoencoder shows improvement over the baseline model with a higher AUPRC score which is higher than of baseline model. This means that the deep autoencoder is better at assigning higher anomaly scores to fraudulent transactions than base model.

The model is able to detect approximately 71% (Recall 0.71) of fraud cases which indicates a strong recall for an imbalanced dataset. However, the precision remains low meaning that a substantial number of normal transactions were flagged as anomalies.

The above is confirmed by error distribution plot where an overlap still exists between normal and fraud transactions. The precision recall curve confirms this model’s superiority over the base model with precision being higher at low recall levels and declines more gradually. The threshold vs precision recall curve shows that model maintains high recall among lower threhsolds but begins to decline when threshold levels are increased. The lift curve suggests that the model is good at ranking fraudulent transactions at initial samples and good number of fraud cases can be differentiated with small percentage of transactions.

Overall, deep autoencoder substantially improves the anomaly detection process providing substantially better separability and ranking performance as evidenced by incresaed AUPRC and recall.

Sparse Autoencoder

Sparse Autoencoder (SAE) aim to achieve more explainability to the autoencoder architecture. An autoencoder process activates hidden layers that learn patterns in the data, but these layers might be learning non essential features in the data. A SAE is designed to learn a compressed but more selective representation of the input by forcing only a small number of neurons in the hidden layer to activate for any given sample.

In SAE, a L1 penalty (Spase Penalty) is applied so that unnecessary hidden layers are not activated and model learns most important underlying patterns rather than memorizing all input details. In the context of anomaly detection, this is useful because normal transactions are expected to activate a consistent set of latent features and therefore reconstruct well, while fraudulent transactions are less likely to align with these sparse learned patterns and should produce higher reconstruction error.

Theoretical Hypothesis

In financial datasets, a fraudulent transaction may be anomalous across one or two dimensions (like time of day, amount etc.), while remaining features imitate normal behavior. By forcing the L1 penalty, the hypothesis is that by feature isolation, the network will be forced to rely on a minimal set of activations, and the model cannot use a generalized representation of data [6].

Architecture & Results

Layers: 34-8-4-4-8-34| Activation – ReLU | Loss – MSE, Epochs -50
Parameters – Total (662), Trainable (662)

The following are the results:

Table 3 – Results of Sparse autoencoder

SAE results show a decline in performance with a low AUPRC score. The added sparsity constraint has over constrained the model, reducing its ability to reconstruct normal patterns effectively. Precision and Recall are worse than earlier models along with F1 score. Reconstruction error plot shows significant overlap between classes and a high threshold for anomalies. The precision recall curve confirms poor performance with precision dropping as quickly as recall increases. Recall curve shows limited improved improvement over random selection.

Denoising Autoencoder

Often, the input data may contain noise/corrupted information and an autoencoder would include the noise in learning pattern as well. In this type of auto encoder, a random noise is added to the input data during training. The aim is for the model to subtract the noise and learn the meaningful features of data. In our model, we introduce Gaussian Noise, which helps the model to ignore small fluctuations, focus on underlying pattern and not exact values.

Gaussian Noise (std dev) adds random Gaussian (normal) noise to the input during training only.

So, each training input becomes original_value + random_noise

Theoretical Hypothesis

Most financial datasets have noise especially with credit card transactions, a large transaction may exhibit natural variance. Using a standard autoencoder may mimic this noise leading to false positives. An intentional Gaussian noise added to input data during training forces the model to learn the foundational data for normal transactions and become robust against small input perturbations [7].

Architecture & Results

Layers: 34-34-8-4-8-34| Noise – Gaussian | Activation – ReLu | Loss – MSE, Epochs -50
Parameters – Total (662), Trainable (662)

The following are the results:

Table 4 – Results of Denoising autoencoder

With AUPRC of 0.117, the denoising autoencoder performance is the lowest so far among other variants. Low precision, recall and F1 score of 0.20 indicates poor ranking and stability. The precision recall curve shows an immediate decline along with lift curve that shows modest prioritization of fraud in top ranked transactions. This implies that introduction of noise during training may have reduced model’ sensitivity to important anomaly patterns in data.

Variational Autoencoder

Variational Autoencoders (VAE) not only learns the representation of data during the process of encoding and decoding but also models the latent variables as probability distribution. The model uses this distribution to reconstruct the input. During the encoding phase, input is mapped over the latent space to distribution using mean and variance which are functions of input data learned by the model. The latent variable is then picked from this distribution.

The VAE is trained on two losses –

Reconstruction Loss – This is how well the decoder builds the input
Kullback-Liebler (KL) Divergence Loss – Compares the latent distribution to a normal distribution

Loss = Reconstruction Loss + KL Divergence

In anomaly detection, VAE’s work well because they learn behavior based on normal distribution and make unusual transaction stand out through their reconstruction errors.

Theoretical Hypothesis

The previous models have been deterministic in nature and compress inputs into fixed spatial coordinates which leads to erratic thresholds [8]. The VAE resolves this by introducing a probabilistic regularization in latent space, encouraging the learned representation to follow a smooth distribution rather than a purely deterministic encoding. The hypothesis is that this model should be able to overcome the representational gaps in earlier models.

Architecture & Results

Layers: 34-34-8-4-8-34 | Activation – ReLu | Loss – MSE+ KL, Epochs -50
Parameters – Total (1,354), Trainable (1,354)

The following are the results:

Table 5 – Results of Variational autoencoder

The variational autoencoder comes out to be the weakest among all models so far with an AUPRC of 0.099. The precision and recall are also low with the lowest F1 score of 0.17 among all models. The same is also reflected in precision recall curve which drops quickly and remains low across most recall levels. The probabilistic regularization imposed on the latent space smooths the learned representation too aggressively and reduces the model’s sensitivity to subtle fraud-specific deviations. In this setting, the structured latent space of the VAE appears to come at the cost of anomaly discrimination rather than improving it.

Conclusion

The empirical evaluation of the five autoencoder models yields a highly instructive conclusion for highly compressed, imbalanced and mathematically transformed data. The deeper dense autoencoder emerged as superior architecture with 0.71 recall an AUPRC of 0.399. The precision recall curve also demonstrates the superiority of the model over other autoencoders.

Fig 23– Precision Recall Curve of all autoencoder models

Table 6 – Performance Results for all autoencoder models

This result exposes a key vulnerability of running heavily regularized models such as Sparse (AUPRC: 0.122), Denoising (AUPRC: 0.117), and Variational (AUPRC: 0.099) autoencoders on PCA transformed data. While these models are highly effective in raw, noisy domains like image/NLP processing, they lead to reduced performance here.

Two major reasons for this are –

Latent space over smoothing – The KL divergence penalty for VAE caused a structural conflict between its probabilistic nature and dataset’s dense orthogonal space. The penalty smoothed the representation aggressively removing the differentiation from the localized anomalies and their sophisticated behavior
Information loss via over constraint – Similarly by adding artificial sparsity or denoise feature, the models actively discarded topological information resulting in an underfitted representation of normal class

The landscape of financial anomaly detection is constantly evolving. As fraudsters constantly evolve their tactics, it is imperative for modern architectures to adapt. It would be interesting to benchmark these results against sequence-based models or GNNs that can map the spatial relationship between features. Building robust ML models for highly imbalanced datasets is key to identifying true anomalies to safeguard financial ecosystems.

The code for this article can be found in my github and the original dataset here

References

Federal Trade Commission Report 2017 (https://www.ftc.gov/news-events/news/press-releases/2025/03/new-ftc-data-show-big-jump-reported-losses-fraud-125-billion-2024)
UK Finance Annual Fraud Report 2025
Gayan K. Kulatilleke , Challenges and Complexities in Machine Learning based Credit Card Fraud Detection (https://arxiv.org/pdf/2208.10943)
Number of Credit Card Transactions per Second, Day & Year -CapitalOne (https://capitaloneshopping.com/research/number-of-credit-card-transactions/)
Hinton, G. E., & Salakhutdinov, R. R. (2006). "Reducing the dimensionality of data with neural networks." Science, 313(5786), 504-507. ( https://www.cs.toronto.edu/~hinton/absps/science.pdf)
Ng, A. (2011). "Sparse autoencoder." CS294A Lecture notes, 72(2011), 1-19. Stanford University. https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
7. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning (ICML) (pp. 1096-1103).
Kingma, D. P., & Welling, M. (2013). "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (Presented at ICLR 2014).

About the Author

Karan Gupta is a seasoned AI & Data Engineer with over 15 years of experience across AI & Data Engineering. He has a proven record of delivering impactful technology solutions and actively contributes to the tech community through IEEE engagements, peer reviews and hackathons.

Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.

An Evaluation of Autoencoder Architectures for Fraud Detection in Credit Card Transactions

By Karan Gupta on

April 25, 2026

A guide to understand different autoencoder architectures for credit card anomalies

The banking industry has undertaken sophisticated efforts like the use of machine learning to detect fraud. However, credit transactions fraud detection comes with a unique set of challenges [3].

The transaction volume is extremely large, and the class distribution is highly imbalanced. Significant class imbalance that makes the traditional ML evaluation methods less effective
Fraudsters continuously evolve with protocols and find innovative ways to conduct fraud, requiring the models to re-learn the new patterns
The data is subject to confidentiality and contains personal identifiable information and hence cannot be released without masking.

Why Autoencoders?

This makes the model highly proficient in recreating normal transactions but fails to fully map the fraudulent transactions leading to high reconstruction error.

How does an autoencoder work?

Encoder – Compresses the input data into lower dimensional representation (bottleneck space)
Non-Linear Transformations – Various nonlinear transformations are done in the middle layers to recognize complex structures
Latent Space/Bottleneck– In this state the data has only the relevant information to be recreated in next stage
Decoder – Decoder takes the compressed representation and reconstructs the original input
Epochs – They refer to the number of full passes thorough the training dataset during optimization

Common Metrics

Reconstruction Error (Mean Squared Error) – This is the average of the squared difference between the original feature values and the reconstructed feature
Precision – This is the ratio of true positives out of all positives identified by the model
Recall – Also called sensitivity or true positive rate is the ratio of true positive predictions from all actual positive results in the dataset
F1 score – F1 score takes into account of both precision and recall and is a weighted harmonic mean of both. It means that if either of precision or recall is low then F1 score will also be low
AUROC (Area under receiver operating characteristic) – This indicates model’s accuracy in ranking positives above negatives at all thresholds. This can be misleading in highly imbalanced datasets
AUPRC (Area under precision recall curve) – How good is the model is predicting true positives while keeping false positives lower

Dataset, Preprocessing and Assumptions

Fig 1: Distribution of Normal vs Fraud classes

Fig 2: Feature distribution snapshot

Assumptions:

The dataset reflects real-world transaction behavior and contains accurate fraud labels.
Fraud is rare, so models are trained primarily on normal transactions (unsupervised anomaly detection).
Reconstruction error is a valid signal for identifying anomalous or fraudulent behavior.
Data is properly scaled and preprocessed before training all autoencoder models.
Validation data is representative of production patterns, allowing thresholds to generalize effectively.
For consistency across all architectures, anomaly thresholds were selected using the 99.5th percentile of the validation reconstruction error distribution.

Cross Architectural Evaluation & Performance

Here we will explain the basic architecture, and results for each type of autoencoder

Baseline Autoencoder

Theoretical Hypothesis

Architecture & Results

Layers: 34-8-4-8-34 | Activation – ReLU | Loss – MSE
Parameters – Total (662), Trainable (662)

The following are the results of the basic autoencoder model -

Table 1 – Results of base autoencoder

The precision recall curve quickly declines as recall increases signaling that model struggles to maintain precision.

The baseline autoencoder is able to learn pattern and identify extreme anomalies but fails to distinguish moderate anomalous transactions, resulting in poor and unstable decision boundaries.

Deeper Dense Autoencoder

Theoretical Hypothesis

Architecture & Results

Layers: 34-32-16-4-32-34| Activation – ReLU | Loss – MSE, Epochs -50
Parameters – Total (3,462), Trainable (3,462)

The following are the results -

Table 2 – Results of Deep Dense autoencoder

Overall, deep autoencoder substantially improves the anomaly detection process providing substantially better separability and ranking performance as evidenced by incresaed AUPRC and recall.

Sparse Autoencoder

Theoretical Hypothesis

Architecture & Results

Layers: 34-8-4-4-8-34| Activation – ReLU | Loss – MSE, Epochs -50
Parameters – Total (662), Trainable (662)

The following are the results:

Table 3 – Results of Sparse autoencoder

Denoising Autoencoder

Gaussian Noise (std dev) adds random Gaussian (normal) noise to the input during training only.

So, each training input becomes original_value + random_noise

Theoretical Hypothesis

Architecture & Results

Layers: 34-34-8-4-8-34| Noise – Gaussian | Activation – ReLu | Loss – MSE, Epochs -50
Parameters – Total (662), Trainable (662)

The following are the results:

Table 4 – Results of Denoising autoencoder

Variational Autoencoder

The VAE is trained on two losses –

Reconstruction Loss – This is how well the decoder builds the input
Kullback-Liebler (KL) Divergence Loss – Compares the latent distribution to a normal distribution

Loss = Reconstruction Loss + KL Divergence

In anomaly detection, VAE’s work well because they learn behavior based on normal distribution and make unusual transaction stand out through their reconstruction errors.

Theoretical Hypothesis

Architecture & Results

Layers: 34-34-8-4-8-34 | Activation – ReLu | Loss – MSE+ KL, Epochs -50
Parameters – Total (1,354), Trainable (1,354)

The following are the results:

Table 5 – Results of Variational autoencoder

Conclusion

Fig 23– Precision Recall Curve of all autoencoder models

Table 6 – Performance Results for all autoencoder models

Two major reasons for this are –

Latent space over smoothing – The KL divergence penalty for VAE caused a structural conflict between its probabilistic nature and dataset’s dense orthogonal space. The penalty smoothed the representation aggressively removing the differentiation from the localized anomalies and their sophisticated behavior
Information loss via over constraint – Similarly by adding artificial sparsity or denoise feature, the models actively discarded topological information resulting in an underfitted representation of normal class

The code for this article can be found in my github and the original dataset here

References

Federal Trade Commission Report 2017 (https://www.ftc.gov/news-events/news/press-releases/2025/03/new-ftc-data-show-big-jump-reported-losses-fraud-125-billion-2024)
UK Finance Annual Fraud Report 2025
Gayan K. Kulatilleke , Challenges and Complexities in Machine Learning based Credit Card Fraud Detection (https://arxiv.org/pdf/2208.10943)
Number of Credit Card Transactions per Second, Day & Year -CapitalOne (https://capitaloneshopping.com/research/number-of-credit-card-transactions/)
Hinton, G. E., & Salakhutdinov, R. R. (2006). "Reducing the dimensionality of data with neural networks." Science, 313(5786), 504-507. ( https://www.cs.toronto.edu/~hinton/absps/science.pdf)
Ng, A. (2011). "Sparse autoencoder." CS294A Lecture notes, 72(2011), 1-19. Stanford University. https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
7. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning (ICML) (pp. 1096-1103).
Kingma, D. P., & Welling, M. (2013). "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (Presented at ICLR 2014).