Monitoring LLM Safety with BERTopic: Clustering Failure Modes for Actionable Insights

By Jofia Jose Prakash on

Large language models (LLMs) now power customer support chatbots, internal developer assistants, and code-generation tools. These systems generate mountains of unstructured telemetry like prompts, responses, tool calls, and routing metadata where safety failures often hide in plain sight. The core challenge is not a lack of data, but a lack of structure: safety signals are embedded in text, not clean metrics.

In this article, I’ll show how a topic-modeling workflow of embeddings + UMAP + HDBSCAN + c-TF-IDF can turn raw LLM interactions into an interpretable “map” of failure modes you can monitor over time. The goal is practical: help safety, risk, and engineering teams’ triage faster, detect emerging patterns, and build targeted mitigations.

The Safety Monitoring Problem

LLMs can fail in diverse ways: e.g., hallucinations that are confident-sounding or nonsensical; toxic, harassing, policy-violating language, or instructions; prompt injection attempts; privacy leakage; or cost blowups (often from overly verbose prompts or runaway generation). Failures can have cascading effects in a system [1], especially if the model has access to tools, retrieval systems, or user data creating “silent” reliability and safety problems that are hard to identify in latency/error-rate dashboards.

Keyword filters, heuristic rules, and manual review can help, but those approaches do not scale and they lack nuance [2] (e.g., a privacy leak that does not contain any obvious keywords). If your logs are high-volume, you need a way to summarize what is happening semantically, not just event-counting.

Clustered failure modes can help: group similar incidents into coherent clusters, label them, and track cluster-level trends instead of reviewing isolated log lines.

A Topic-Modeling-Driven Pipeline for Safety Signals

Topic modeling, previously the domain of academic text mining, can be used for this task. The general workflow involves representing each log line as a vector, compressing high-dimensional vectors for clustering, clustering similar logs together, and labeling the clusters with human-readable strings. BERTopic combines embeddings, dimensionality reduction, density-based clustering, and class-TF-IDF to derive interpretable topics from large corpora. Here is a step-by-step pipeline for safety monitoring.

Fig 1. BERTopic Pipeline for Safety Monitoring

Data Inputs

Here, take input data from LLM interaction logs. A simple record may consist of the user prompt, the model response, timestamps, model version, route (e.g., “customer support”), user segment (internal vs external), and current safety flags. All PII should be redacted or anonymized before analysis. Basic normalization (lower-casing, removing HTML, dropping boilerplate) and splitting of prompts and responses lets the models focus on content vs formatting. If multiple turns are bundled in a single conversation, it’s also a design choice as to whether to separate each turn for analysis, or concatenate turns to include context.

Here’s a snippet of code that loads the first 1000 conversation pairs from the publicly available ToxicChat[3] dataset, and stitches together a single text field containing both the user prompt and assistant response.

Embeddings

To cluster the text, first transform each log into a vector, which is the numerical representation of the text. State-of-the-art sentence-embedding models like Sentence-BERT or MiniLM translate a sequence of words into a 384- to 768-dimensional vector whose inner product geometry encodes semantic similarity. For safety monitoring, we can embed the response alone, the prompt with response, or the full conversation. Embedding only responses is simpler but loses contextual information. Concatenating prompt and response includes the user’s intent as well as the model’s response, but the full turns must be truncated to a token length that fits into embedding models. It is usually best to create the embeddings at the turn level and aggregate later at the session level if desired.

The following snippet shows how to generate embeddings with a lightweight model and run BERTopic. Here, we can provide the precomputed embeddings to BERTopic to prevent repeated computation when experimenting with different cluster parameters.

Dimensionality Reduction with UMAP

Embeddings cannot be clustered directly in high dimensions as distance metrics are uninformative in high dimensions. Uniform Manifold Approximation and Projection (UMAP) embed data into a low-dimensional manifold that preserves local neighborhoods. In other words, UMAP attempts to position your documents on a 2-or a 5-dimensional map such that close points remain close to each other. BERTopic internally uses UMAP to reduce your embeddings prior to clustering, both to speed up clustering and to allow for visualizations. You can also provide your own UMAP model if you wish, as shown here:

Clustering Safety Behaviours with HDBSCAN and BERTopic

Density-based clustering algorithms, such as HDBSCAN, detect clusters of points where the data are dense, and identify outliers as noise. These algorithms do not require you to specify the number of clusters in advance, unlike k-means, and are thus useful when the number of failure modes is not known. BERTopic extends HDBSCAN: after performing UMAP reduction, BERTopic runs HDBSCAN to find clusters, and then computes class-TF-IDF to extract the most characteristic words for each cluster. These keywords render the clusters interpretable: rather than having an opaque numeric label for each cluster, it instead receives a human-readable description in the form of a list of keywords that define the topic.

You can also, instead of BERTopic, use HDBSCAN directly to cluster the UMAP embeddings, and then summarize each cluster using an LLM or via manual inspection. When dealing with a large number of items, BERTopic's pipeline is convenient because it automatically performs stop-word removal and phrase extraction.

Topic Naming and De-duplication

Clusters are only useful if humans can understand them. Review the top keywords (BERTopic) or a sample of log lines in each cluster, and give it a name that human annotators can recognize (such as “PII leakage,” “harassment and slurs,” or “token overuse and numeric hallucinations”). Merge together any multiple clusters that seem to represent the same failure mode, and mark any overly small or uninformative clusters as noise. (Some teams automate this step using an LLM to predict a label given the top keywords, but make sure to keep a human in the loop, so that you don't accept hallucinated labels.)

Example: Clustering Failure Modes in LLM Logs

For a concrete example, 500 examples from ToxicChat[3] are clustered. The sampling is stratified to keep an equal number of benign, toxic, and jailbreak examples. The full prompt+response is embedded, UMAP is used to reduce dimensionality, and BERTopic is run with HDBSCAN to cluster. The number of clusters found will vary based on the structure of the data and the HDBSCAN parameters used (min_cluster_size=10); in practice, this will find several coherent topic clusters. Each cluster is a bucket of related failure modes that can be triaged together as a single unit.

Fig 2. Interactive topic explorer showing the discovered clusters

Fig 3. Document scatter plot showing topic clusters discovered by BERTopic

From Clusters to Safety Monitoring

Once clusters are named, they become actionable safety signals. Map each cluster to a risk category (for example, "Privacy & PII Exposure", "Harassment & Slurs", "Self-Harm & Crisis Content", or "Prompt Injection & Jailbreaks"):

  • Clusters that discuss or disclose PII map to "Privacy & PII Exposure";
  • Clusters that contain slurs or similar harmful language map to "Harassment & Slurs";
  • Clusters that contain requests for self‑harm or share crisis content map to "Self-Harm & Crisis Content";
  • Clusters that attempt prompt‑injection maps to "Prompt Injection & Jailbreaks".

Tracking the volume of each cluster over time can help you identify emerging risks: if there's a spike in the "Privacy & PII Exposure" cluster, you know there was a spike in privacy leaks. You can also monitor the fraction of logs that fall into each cluster against total traffic to spot early warning signs of model drift or attack campaigns. The heatmap in Fig 4, displays the actual cluster labels discovered in the example run.

Fig 4. Safety risk heatmap showing toxicity and jailbreak rates across discovered failure mode clusters. Darker colors indicate higher-risk clusters requiring immediate attention.

One simple way to prioritise clusters is to compute a risk score that aggregates existing safety signals (such as toxicity and jailbreak flags). The snippet below shows how to aggregate cluster statistics and compute a weighted risk score. You can customize the weights according to your priorities.

Clusters with the highest risk scores should be reviewed first. In production, you can use additional signals such as PII detectors, refusal‑rate metrics, or domain‑specific flags. The key is consistency: pick a small set of interpretable signals, compute them per cluster, and track their trendlines.

Evaluation: Does This Help Catch Real Safety Issues?

A safety-monitoring pipeline must be validated. Two complementary approaches work well.

Silver Labels at Scale

Moderation classifiers or rule-based filters that are already in place can be used to generate silver labels. For instance, moderation APIs can flag sexual, violent, hateful, or self-harm content, and content-moderation datasets such as ToxicChat come with their own labels for toxicity, harassment, and so on. Apply these labels to your logs and see if the clusters map to coherent safety categories. A strong clustering will have high purity: most of the logs in the “privacy and harassment” cluster should be flagged by the privacy or toxicity classifier. It should also have good recall for novel patterns that classifiers don’t cover: for example, you may discover new failure modes (cost abuse) that don’t match any existing rules.

Gold Labels on a Curated Sample

For a more fine-grained evaluation, take a random stratified sample of logs from each cluster and have safety experts annotate the failure mode and severity. Compute the inter-annotator agreement (e.g., Cohen’s kappa) to gauge the reliability. Precision@K (fraction of top-K clusters that represent true safety issues), Normalized Discounted Cumulative Gain (NDCG) which measures if the most severe clusters are ranked at the top, cluster stability across different runs, and reduction in manual review effort (scan 20 topics instead of 10000 isolated logs) are metrics that can be computed on the annotated sample to provide quantitative evidence that the entire pipeline is surfacing meaningful patterns.

Practical Considerations and Pitfalls

Costs and Batching

Executing BERTopic on millions of logs requires non-trivial compute resources, and generating LLM-based labels can be expensive as well. Batch your embedding requests, and cache the results. Use lightweight embedding models for the daily monitoring use-case, and save the heavy-duty, larger models for periodic deep dives. To streamline labeling of clusters, ask a lightweight model to generate a list of candidate names, then have humans vet and approve.

Hallucination Risk and Mitigation

If you're using LLMs to label clusters, you need to watch out for hallucination phenomena. In some cases, the LLMs can generate labels that are incorrect or misleading. Always ground labels back to concrete examples; a best practice is to maintain provenance by storing some representative log lines for each cluster. Safety review experts should check whether the generated labels seem appropriate for the actual content.

Data Governance and Privacy

Safety monitoring operations obviously touch on many of the same sensitive-data handling concerns you've encountered in your own observability functions. There's a risk of exposing personal data, private business information, or regulated content. Follow your organization's data-governance policies: mask or scrub PII, limit access to preauthorized people, and make sure that your monitoring systems are in regulatory compliance (GDPR, CCPA, etc.). Some generative AI systems memorize their training data and can leak it; continuous monitoring can detect such leaks, provide audit trails [4].

Identity Ambiguity and Aggregation

Aggregating at the user level means that identity-matching may be imprecise (noisy). For example, a person may have multiple accounts. When labeling clusters, try not to single out individual identities for action. Aggregate signals at the cohort or system level. After all, the whole point of monitoring is to flag problematic patterns, not to punish particular individuals. In some cases, such as research or analysis use-cases, you may want to work with hashed or pseudonymized identifiers instead.

Integration with Existing Observability Tools

A topic-modeling pipeline is complementary to, not a replacement for, your conventional observability stacks. You still want to keep a watchful eye on latency, throughput, error rates, token-usage quotas for your LLM service, and many others. Many organizations combine dashboards that show the technical metrics alongside cluster counts and topic labels. Monitoring tools can alert you when the volume of specific clusters hits a threshold, and continuous retraining of the LLMs helps you keep the clusters relevant as natural language patterns shift.

Conclusion

LLMs can be used to power new customer interfaces and workflows across organizations, but they also create new vectors of safety failure. These safety risks do not present themselves in the form of obvious alerts but rather propagate through logs and unstructured text; consequently, we need tools that go beyond dashboards and basic classifiers. By leveraging embeddings, UMAP, HDBSCAN, and class-TF-IDF, BERTopic produces an interpretable visualization of LLM failure modes that safety, risk, and engineering teams can use to understand where models are drifting, what harms are emerging, and how to direct mitigation resources. An end-to-end observability is necessary for generative AI pipelines to be secure; topic-modeling provides an additional layer of semantic observability designed specifically for unstructured logs. Another observation this feature makes is that teams cannot manually monitor logs as telemetry grows [2]; we must automate. With the techniques demonstrated in this feature, organizations will be able to detect hallucinations, privacy leaks, and even undiscovered system-level failure modes before they can do harm.

Checklist: If You Do This, Don’t Forget…

  • Collect and preprocess responsibly: normalize and redact PII before analysis.
  • Choose the right embedding model: it's a balance between accuracy vs cost; embed both prompts and responses if context is important.
  • Tune UMAP and HDBSCAN: Small changes to the number of neighbors or the minimum cluster size can dramatically affect clusters; experiment on a sample before scaling.
  • Keep humans in the loop: LLMs can draft cluster names, but experts should verify and merge topics.
  • Monitor over time: Cluster volumes and severity should be watched over time, and thresholds should be revisited as traffic or models change.
  • Maintain provenance: store representative logs for each cluster to support root-cause analysis.
  • Respect data governance: Ensure monitoring is compliant with privacy laws and internal policies.

About the Author

Jofia Jose Prakash (Senior Member, IEEE Computer Society) is an AI Researcher and Enterprise Strategist who builds enterprise-grade AI systems that deliver measurable business and societal impact. With end-to-end AI/ML expertise from data and model development to deployment and governance, she designs ethical, scalable solutions that align with business goals, regulations, and responsible-AI standards. She also contributes to the AI community through thought leadership, mentoring, and initiatives that advance transparency and accountability.

References:

[1] V. Vinay, “A System-Level Taxonomy of Failure Modes in Large Language Model Applications,” arXiv, arXiv:2511.19933, Nov. 2025. [Online]. Available: https://arxiv.org/abs/2511.19933. Accessed: Dec. 18, 2025. arXiv

[2] H. A. Work, “Autonomous Observability: AI Agents That Debug AI,” IEEE Computer Society Tech News (Community Voices). [Online]. Available: https://www.computer.org/publications/tech-news/community-voices/autonomous-observability-ai-agents. Accessed: Dec. 18, 2025. IEEE Computer Society

[3] Z. Lin, Z. Wang, Y. Tong, Y. Wang, Y. Guo, Y. Wang, and J. Shang, “ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions,” LMSYS Blog, Oct. 30, 2023. [Online]. Available: https://lmsys.org/blog/2023-10-30-toxicchat/. Accessed: Dec. 18, 2025. lmsys.org

[4] nexos.ai experts, “LLM monitoring: Definition, metrics, and best practices,” nexos.ai, Dec. 3, 2025. [Online]. Available: https://nexos.ai/blog/llm-monitoring/. Accessed: Dec. 18, 2025. nexos.ai

Additional Resources:

Full code implementation is available in this GitHub repository: https://github.com/jofiajoseprakash/llm-safety-topic-modeling

Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.