A guide to understand different autoencoder architectures for credit card anomalies
According to a 2024 Federal Trade commission report in 2024, consumers in the US lost more than $12.5 bn to financial fraud [1]. In another example, UK Finance trade association, £1.17 bn was lost to fraud in 2024 while £1.45 Bn of unauthorized fraud was prevented by industry [2].
The banking industry has undertaken sophisticated efforts like the use of machine learning to detect fraud. However, credit transactions fraud detection comes with a unique set of challenges [3].
Anomalies are defined as observations that deviate from normal patterns in the dataset. Traditional supervised learning algorithms aim to optimize for global accuracy and hence develop bias towards majority class (normal transactions) and frequently identify fraud as benign.
Autoencoders are commonly explored for fraud detection as they learn compressed representations of high-dimensional transaction data. Autoencoders reframe fraud detection as an anomaly detection problem by learning the pattern on normal transactions and captures compressed low dimensional pattern that captures nonlinear relationship in purchases.
This makes the model highly proficient in recreating normal transactions but fails to fully map the fraudulent transactions leading to high reconstruction error.
Originally developed for data compression, the fundamental principle is the dimensionality reduction process where the input data is mapped by an encoder through hidden layers. The encoder transforms the input x into a latest representation z. Subsequently the decoder compresses the code to produce output data with dimension d0. The primary architecture comprises of
The dataset is a highly unbalanced set of transactions containing 492 frauds out of 284,807 transactions [4]. The positive fraud cases are of 0.172% of transactions. The target variable of class represents whether a transaction is legitimate (0) or fraud (1). Fraud transactions make up a fraction of the dataset. Due to confidential nature of data, there are 28 standardized principal components along with Time and Amount.

Fig 1: Distribution of Normal vs Fraud classes
The scales of the engineered numerical features can vary dramatically with some clustering around zero. Since reconstruction error is scale dependent, the auto encoder can focus on the large magnitude feature which can dominate the resulting error value.

Fig 2: Feature distribution snapshot
Looking at the feature distribution snapshot, we can see that most features are tightly distributed, and the dataset is highly compressed. Nevertheless, since reconstruction error is sensitive to scaling, Standard Scaler was applied to all features.

In order to evaluate the model fairly, the dataset is split into training (70%), validation (15%) and test (15%) subsets. Only normal transactions used for training which is needed to ensure auto encoder learns normal behavior.
Assumptions:
Here we will explain the basic architecture, and results for each type of autoencoder
This is simplest reference point in our comparison and uses a basic model. We assume a shallow encoder -decoder structure with a single hidden layer on each side and a 4-dimensional bottleneck. The baseline model assumes that normal transactions can be reconstructed with minimal loss while anomalous transactions deviate and have high reconstruction error. The architecture also assumes that all input features contribute equally and relationship between variables can be captured thorough dense feedforward layers [4].
The baseline architecture serves as our control model, and it tests the foundational premise that a shallow bottleneck is sufficient to capture the behavior of normal transactions. By restricting the dimensional space, the hypothesis is the model would successfully compress and reconstruct the pattern of legitimate transactions while simultaneously failing to reconstruct anomalous fraud signatures.
The following are the results of the basic autoencoder model -

Table 1 – Results of base autoencoder
The AUPRC of 0.1441 for baseline autoencoder indicates that the model cannot distinguish between the two classes well. The model yields a precision of 0.15 and a recall of 0.46, resulting in an F1 score of 0.23. This means that the model is able to identify 46% of fraudulent transactions but only 15% are flagged as actual fraud, leading to higher number of false positives.

The precision recall curve quickly declines as recall increases signaling that model struggles to maintain precision.
The reconstruction error plot also shows significant overlap between both classes. Most transactions are concentrated in a narrow area near zero with a small number of observations extending into higher error regions. The threshold lies far into to the tail end of the distribution indicating that only extreme cases are flagged as anomalies. The threshold vs precision recall curve highlights another limitation of threshold instability with small changes in threshold leading to large fluctuations in precision and recall.
The cumulative gains curve illustrates how effectively the model ranks transactions by anomaly score. The model shows a sharp initial rise meaning that a large fraction of fraud is identified with the first 10%-20% of transactions and then flattens indicating that the model struggles to identify extreme anomalies
The baseline autoencoder is able to learn pattern and identify extreme anomalies but fails to distinguish moderate anomalous transactions, resulting in poor and unstable decision boundaries.
Deep autoencoder increases the depth of the model by adding additional layers to both encoder and decoder and enhances the capacity to learn complex data patterns. In deep autoencoder architecture, the input data goes to various layers of encoder with each layer capturing different levels of detail.
At the deepest layers, only the most vital characteristic of data is preserved. This representation is then steadily augmented by adding the details back through the decoder layers. The benefits of adding more layers are greater ability to capture complex patterns with initial layers capturing the basic details while deeper layers identify nuances to differentiate fraud. Deeper autoencoders can model more complex nonlinear structure, which is particularly useful in domains such as image representation and feature compression [5].
Datasets that have undergone PCA transformations may has highly complex non-linear relationships that a shallow network may underfit [5]. With deep autoencoder, we address this issue by adding multiple dense layers to ensure that the model can learn a nuanced multi layered representation of legitimate behavior.
The following are the results -

Table 2 – Results of Deep Dense autoencoder
The deep autoencoder shows improvement over the baseline model with a higher AUPRC score which is higher than of baseline model. This means that the deep autoencoder is better at assigning higher anomaly scores to fraudulent transactions than base model.
The model is able to detect approximately 71% (Recall 0.71) of fraud cases which indicates a strong recall for an imbalanced dataset. However, the precision remains low meaning that a substantial number of normal transactions were flagged as anomalies.

The above is confirmed by error distribution plot where an overlap still exists between normal and fraud transactions. The precision recall curve confirms this model’s superiority over the base model with precision being higher at low recall levels and declines more gradually. The threshold vs precision recall curve shows that model maintains high recall among lower threhsolds but begins to decline when threshold levels are increased. The lift curve suggests that the model is good at ranking fraudulent transactions at initial samples and good number of fraud cases can be differentiated with small percentage of transactions.
Overall, deep autoencoder substantially improves the anomaly detection process providing substantially better separability and ranking performance as evidenced by incresaed AUPRC and recall.
Sparse Autoencoder (SAE) aim to achieve more explainability to the autoencoder architecture. An autoencoder process activates hidden layers that learn patterns in the data, but these layers might be learning non essential features in the data. A SAE is designed to learn a compressed but more selective representation of the input by forcing only a small number of neurons in the hidden layer to activate for any given sample.
In SAE, a L1 penalty (Spase Penalty) is applied so that unnecessary hidden layers are not activated and model learns most important underlying patterns rather than memorizing all input details. In the context of anomaly detection, this is useful because normal transactions are expected to activate a consistent set of latent features and therefore reconstruct well, while fraudulent transactions are less likely to align with these sparse learned patterns and should produce higher reconstruction error.
In financial datasets, a fraudulent transaction may be anomalous across one or two dimensions (like time of day, amount etc.), while remaining features imitate normal behavior. By forcing the L1 penalty, the hypothesis is that by feature isolation, the network will be forced to rely on a minimal set of activations, and the model cannot use a generalized representation of data [6].
The following are the results:

Table 3 – Results of Sparse autoencoder
SAE results show a decline in performance with a low AUPRC score. The added sparsity constraint has over constrained the model, reducing its ability to reconstruct normal patterns effectively. Precision and Recall are worse than earlier models along with F1 score. Reconstruction error plot shows significant overlap between classes and a high threshold for anomalies. The precision recall curve confirms poor performance with precision dropping as quickly as recall increases. Recall curve shows limited improved improvement over random selection.

Often, the input data may contain noise/corrupted information and an autoencoder would include the noise in learning pattern as well. In this type of auto encoder, a random noise is added to the input data during training. The aim is for the model to subtract the noise and learn the meaningful features of data. In our model, we introduce Gaussian Noise, which helps the model to ignore small fluctuations, focus on underlying pattern and not exact values.
Gaussian Noise (std dev) adds random Gaussian (normal) noise to the input during training only.

So, each training input becomes original_value + random_noise
Most financial datasets have noise especially with credit card transactions, a large transaction may exhibit natural variance. Using a standard autoencoder may mimic this noise leading to false positives. An intentional Gaussian noise added to input data during training forces the model to learn the foundational data for normal transactions and become robust against small input perturbations [7].
The following are the results:

Table 4 – Results of Denoising autoencoder
With AUPRC of 0.117, the denoising autoencoder performance is the lowest so far among other variants. Low precision, recall and F1 score of 0.20 indicates poor ranking and stability. The precision recall curve shows an immediate decline along with lift curve that shows modest prioritization of fraud in top ranked transactions. This implies that introduction of noise during training may have reduced model’ sensitivity to important anomaly patterns in data.

Variational Autoencoders (VAE) not only learns the representation of data during the process of encoding and decoding but also models the latent variables as probability distribution. The model uses this distribution to reconstruct the input. During the encoding phase, input is mapped over the latent space to distribution using mean and variance which are functions of input data learned by the model. The latent variable is then picked from this distribution.
The VAE is trained on two losses –
Loss = Reconstruction Loss + KL Divergence
In anomaly detection, VAE’s work well because they learn behavior based on normal distribution and make unusual transaction stand out through their reconstruction errors.
The previous models have been deterministic in nature and compress inputs into fixed spatial coordinates which leads to erratic thresholds [8]. The VAE resolves this by introducing a probabilistic regularization in latent space, encouraging the learned representation to follow a smooth distribution rather than a purely deterministic encoding. The hypothesis is that this model should be able to overcome the representational gaps in earlier models.
The following are the results:

Table 5 – Results of Variational autoencoder
The variational autoencoder comes out to be the weakest among all models so far with an AUPRC of 0.099. The precision and recall are also low with the lowest F1 score of 0.17 among all models. The same is also reflected in precision recall curve which drops quickly and remains low across most recall levels. The probabilistic regularization imposed on the latent space smooths the learned representation too aggressively and reduces the model’s sensitivity to subtle fraud-specific deviations. In this setting, the structured latent space of the VAE appears to come at the cost of anomaly discrimination rather than improving it.

The empirical evaluation of the five autoencoder models yields a highly instructive conclusion for highly compressed, imbalanced and mathematically transformed data. The deeper dense autoencoder emerged as superior architecture with 0.71 recall an AUPRC of 0.399. The precision recall curve also demonstrates the superiority of the model over other autoencoders.

Fig 23– Precision Recall Curve of all autoencoder models

Table 6 – Performance Results for all autoencoder models
This result exposes a key vulnerability of running heavily regularized models such as Sparse (AUPRC: 0.122), Denoising (AUPRC: 0.117), and Variational (AUPRC: 0.099) autoencoders on PCA transformed data. While these models are highly effective in raw, noisy domains like image/NLP processing, they lead to reduced performance here.
Two major reasons for this are –
The landscape of financial anomaly detection is constantly evolving. As fraudsters constantly evolve their tactics, it is imperative for modern architectures to adapt. It would be interesting to benchmark these results against sequence-based models or GNNs that can map the spatial relationship between features. Building robust ML models for highly imbalanced datasets is key to identifying true anomalies to safeguard financial ecosystems.
The code for this article can be found in my github and the original dataset here
Karan Gupta is a seasoned AI & Data Engineer with over 15 years of experience across AI & Data Engineering. He has a proven record of delivering impactful technology solutions and actively contributes to the tech community through IEEE engagements, peer reviews and hackathons.
Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.