Ensuring high code quality is critical for developing maintainable, scalable, and secure software such as Maxwell (a wellsite data acquisition software platform). Traditional static analysis tools often fall short in detecting context-sensitive issues, particularly those involving dynamic references or loosely typed languages. Moreover, these tools typically rely on manually crafted rules tailored to specific error types, limiting their adaptability to novel or unforeseen problems [1][2].
To address these challenges, a rule-based error checking system has been implemented (figure 1). This solution uses a set of predefined rules written in YAML files, which are parsed and executed sequentially by a Python script. These rules target specific patterns and known error types, offering a structured and deterministic approach to error detection. While effective for well-understood issues, this method requires continuous manual updates and struggles to generalize across diverse or evolving codebases.


Figure 1 – Rule-based MW Error Detection Workflow
Recent advancements in Large Language Models (LLMs) have opened new possibilities for more automated and intuitive error detection, thanks to their contextual understanding and reasoning capabilities [3][4]. However, most LLMs are designed to analyze individual scripts and struggle with large-scale codebases. They face challenges in navigating across multiple files, integrating context from disparate parts of the codebase, and performing reliable, comprehensive analysis [5].
To overcome these limitations, multi-agent LLM frameworks have gained attention for their collaborative problem-solving capabilities. By coordinating multiple LLM agents, these frameworks show promise in tackling complex tasks such as automated code checking across extensive codebases [6][7].
This case study explores the use of LLMs to enhance software quality, with a focus on detecting reference errors in Maxwell source code. These errors often appear in plain text or string literals, such as configuration keys, API endpoints, file paths, and function names, and are not typically validated by compilers, making them prone to runtime failures and broken integrations. Examples include Tool Code, Domain Code, and Measure Point.
By evaluating the effectiveness of LLMs in identifying such issues, the study aims to assess their potential and limitations in automating error detection, particularly in comparison to existing rule-based approaches.
When selecting a suitable large language model (LLM) for a project, several key factors should be considered. Local models offer benefits such as no cloud-related costs, full data privacy, and fast deployment, making them ideal for environments with strict data control requirements. However, they are often constrained by limited model size and depth. On the other hand, cloud-based models typically deliver higher accuracy and more advanced capabilities due to their access to greater computational resources. These advantages come with trade-offs, including increased latency, higher operational costs, a slower approval process, and potential privacy concerns. Balancing these factors is essential to choosing the right model for your specific use case.
Larger LLMs generally require more hardware resources, such as system RAM, GPU VRAM, and processing power. Choose a model size that balances performance needs with available infrastructure.


Table 1 – Rule of Thumb
Driven by constraints in privacy, time, and budget, this experiment was conducted using compact LLMs deployed on a self-contained Dell PowerEdge XR12 workstation. The following section outlines the basic hardware configuration used to support local deployment.
This setup enabled controlled testing of lightweight models within a constrained timeframe. Given the limited resources, Mistral was initially selected for its balance of performance and efficiency. To further benchmark error detection capabilities across different LLMs, Mistral 7B, CodeLlama 13B, and Llama 3.1 8B were deployed locally using the Ollama runtime.
Why Mistral?
Why CodeLlama?
Why Llama 3.1?
Why Ollama?
During the case study, we defined three pilot tasks to evaluate code quality from distinct dimensions: Tool Code, Measure Point, and Domain Code. The initial approach used a single LLM to manage all code quality tasks, including Tool Code checking, Domain Code checking, and Measure Point validation. For each error checking category, the program takes user-provided code scripts as input and leverages 1) rule-based methods for trivial extraction tasks, and 2) an LLM for more complex and non-deterministic extraction and code analysis.
Take the tool code check on the LA3D platform as an example. Verifying the consistency of the tool code involves the following steps:
The tool codes extracted in steps 2 and 3 must match to ensure consistency. To verify tool code consistency, the program requires the user to provide the file path for each step.
Finally, the program compares the tool codes extracted in steps 2 and 3 to check for consistency and ensure the correctness of the tool code used.
We quickly recognized that, due to inherent limitations in LLMs—such as constrained context windows and inconsistent task generalization—it was impractical to rely on a single model to handle all error-checking tasks effectively. To optimize resource utilization and reduce computational overhead, we instantiated three dedicated LLM instances, each assigned to a specific task: Tool Code validation, Measure Point verification, and Domain Code analysis (Figure 2). This task-specific deployment enabled parallel execution and allowed each model to operate within a well-defined scope, improving both processing efficiency and analytical accuracy.


Figure 2 – High-level Multi-Instance MW Error Detection Workflow
While this solution is easy to implement and yields relatively accurate extraction results, it requires substantial time and human expertise for manual re-writing (e.g., programming the extraction rules, preparing specific code scripts for error checking, etc.) for specific error checking tasks. Therefore, it is challenging for this method to automatically generalize to different tasks.
To address the limitations of the single-LLM setup, we adopted a multi-agent LLM architecture to make this system more generalizable and automate the entire error-checking pipeline. An LLM agent is an autonomous, instruction-following entity powered by a language model. It can take tasks, make decisions, and pass information to other agents (also backed by LLMs).
We defined a three-agent pipeline for error checking:
As shown in Figure 3, with this setup, the user only needs to write a list of checking instructions in natural language, including the task description, checking instructions, validation criteria, and file locations. The system will automatically plan the error checking steps and navigate through the code base to conduct the checking. When it comes to a new error checking task, the user only needs to draft a new instruction in plain text.


Figure 3 – Detailed Workflow of the Multi-Agent Collaboration Pipeline
This modular approach improved task specialization, accuracy, system scalability, and robustness.
Planner Agent
The planner agent is responsible for generating a detailed error detection plan based on user-provided instructions. In particular, the generated step-by-step plan is required to have:
Extractor Agent
The extractor agent is mainly responsible for extracting data for non-deterministic extraction types. For instance, if an extraction step requires the LLM to identify a particular name of a function call with similar format of ‘ModImpl’ and extract the first argument, the extractor agent is prompted to read through the code snippet, identify qualified function name, extract the first arguments of such functions, and return the result in a list.
For a long code script, the extraction is conducted by splitting the script into multiple chunks to ensure the extraction quality. To prevent the impact of LLM’s hallucination, after each extraction, there is a separate validation function that validates whether the extracted content exists in the original code snippet. All non-existed extraction will be removed to ensure the reliability of extraction. After all chunks are extracted, the agent will merge all the extracted content into a consolidated result for further analysis.
Analyzer Agent
The analyzer agent is responsible for performing the final checking of the targeted error type based on the extracted data from previous steps. The agent is provided with error checking instructions generated by the planner agent, along with the extracted data produced by the extractor agent.
The architecture is designed to be flexible and user-friendly—allowing developers to describe tasks in plain language while the system handles the complexity behind the scenes. It’s a step toward making intelligent code analysis more accessible and less manual.
The experiment utilized a combination of dummy codebases and real-world Maxwell codebases written in C++ and XML. These included a range of Maxwell tool applications from small to medium scale, specifically ADN (a density and porosity downhole tool with approximately 72,000 lines), and LA3D (a downhole resistivity tool with over 260,000 lines). This selection was designed to provide a diverse representation of code complexity, structure, and size. Preprocessing steps involved anonymizing sensitive data, enforcing consistent formatting standards, and segmenting large files to facilitate analysis and maintain data integrity. This setup ensured a realistic and scalable testbed for evaluating the proposed methods.
While applying small-scale LLMs to improve code quality, several limitations became apparent during experimentation. These challenges highlight the constraints of current model capabilities and underscore the importance of careful model selection and prompt design. The following subsections detail the most prominent issues observed: hallucination, catastrophic forgetting, and limited context comprehension.
Hallucination
The first challenge when applying these small-scale LLMs is ‘hallucination’. Hallucination typically refers to LLM’s irrelevant output, which is not related to the provided context or prompt. As shown in the left part of Figure 4, when using Mistral for extracting tool code, although the extracted result follows the format of an OSDD code argument, the content of the string is ‘invented’ by the LLM. Similarly, in the right part of Figure 4, CodeLlama outputs python functions, which neither exist in the original code script nor follow the requested instructions.
The reason for LLM to ‘hallucinate’ primarily arises from the pre-trained data. As models heavily rely on the data they are trained on, these models are more likely to output content resemble the training data instead of demonstrating full comprehension of the task [8]. Therefore, the reliance of training data and the probabilistic nature of language models result in the ‘hallucination’ in LLM outputs.


Figure 4 - Examples of hallucination in LLMs’ outputs
Catastrophic Forgetting
Catastrophic Forgetting refers to the phenomenon where an LLM loses previously acquired knowledge when it learns new information. In the scenario of error detection, when the input code script is too long, the LLM would forget the code content that appears earlier. Therefore, even if qualified code content (e.g., the tool code string, such as ‘LAPD_675’) exists in the input code script, the LLM would focus more on the latter part of the code script and fail to extract the qualified code content. As shown in Figure 5, the LLM is instructed to extract a tool code string from a .xml file. Since the .xml file is relatively long and repetitive, the model’s output mainly consists of unqualified strings appearing in the latter part of the file.


Figure 5 – Example of LLMs’ catastrophic forgetting
Typically, catastrophic forgetting occurs because LLMs encode knowledge across vast matrices of weights and biases, which can disrupt the delicate balance between maintaining existing knowledge and learning new information [9]. In addition, larger models often show higher resistance to catastrophic forgetting. Since our case study mainly leverages small LLMs, it is understandable that some models could not keep all relevant information in ‘mind’.
Limited Context Comprehension Capabilities
Another primary challenge facing smaller LLMs is their limited capabilities to comprehend the context. These models often fail to fully comprehend the instructions during the analysis stage, producing inconsistent results. As shown in Figure 6, the extracted results (the left part) of step 2 and step 3 are consistent. Thus, the analysis result should be a ‘PASS’ in this case. However, the model outputs ‘FAIL’ despite the correct extraction results.


Figure 6 – Example of LLMs’ failure in conducting correct analysis
LLM’s limited contextual understanding is mainly caused by their limited context windows [10]. A well-sized context window improves LLMs’ ability to generate coherent text and understand complex relationships in the input. However, larger context windows require more computational resources, both in terms of memory and processing time.
Model Performance Analysis
Based on the deployment and evaluation of three distinct LLMs, here are the key findings derived from their performance across targeted code quality tasks.
Common Pros Across All Models
Common Cons


Table 2: Model Performance Analysis
The Mistral 7B model is an example of how efficiency in parameter utilization can lead to high performance, making it a lightweight yet effective language model. With only 7 billion parameters, it demonstrates remarkable achievements in tasks requiring natural language understanding and generation, offering a smaller and faster alternative to larger models while retaining competitive performance. However, its limited role-playing capabilities restrict its applicability in the multi-agent architecture that requires nuanced interactions and persona-based responses. Furthermore, Mistral 7B struggles with code comprehension, making it less suitable for programming-related use cases such as debugging or code generation. A major drawback of this model is its high tendency to hallucinate, where it generates inaccurate or fabricated information. This issue undermines the reliability of its outputs, especially in high-stakes applications like legal or scientific writing, where factual correctness is critical.
CodeLlama 13B, on the other hand, has been explicitly pretrained on large-scale code datasets, which significantly enhances its ability to understand and generate programming code across multiple languages. This specialization makes it a powerful tool for software developers, particularly for tasks such as code generation, refactoring, and debugging. However, its primary limitation lies in its inability to consistently follow specified output formats, which can be frustrating when precise formatting is crucial, such as when adhering to coding standards or generating structured data. Additionally, like many generative models, CodeLlama 13B is prone to hallucination, especially when generating non-existent code or misinterpreting requirements. This behavior poses risks in software development, where incorrect or misleading code can lead to inefficiencies, bugs, or even security vulnerabilities. Overall, while CodeLlama 13B excels in code comprehension and generation, its limitations highlight the importance of human oversight in its application.
Llama 3.1 8B introduces notable improvements, such as an expanded context length, which allows it to process and generate longer sequences of text or code scripts. This capability is particularly advantageous for analyzing or generating large-scale projects where maintaining continuity across multiple sections is essential. The model also demonstrates enhanced instruction-following capabilities, making it more adept at adhering to user directives and performing tasks with greater precision. However, Llama 3.1 8B struggles with generating reliable analytical insights, which may limit its effectiveness in tasks requiring critical reasoning or domain-specific evaluations. For instance, while it can process extensive input data, its analysis may lack depth or factual accuracy, potentially leading to suboptimal outcomes in research or data-driven decision-making. These limitations suggest that while Llama 3.1 8B offers significant advancements in context handling and usability, it requires complementary tools or human intervention for more complex analytical tasks.
We utilized two Dell XR12 workstations, each equipped with a different grade of NVIDIA GPU, to evaluate model performance under varying hardware capabilities.
Workstation 1, equipped with an NVIDIA T1000 GPU (8 GB VRAM), was used to evaluate compact models such as Mistral 7B and Llama 3.1 8B. These models were selected due to their lower VRAM requirements, making them suitable for environments with limited GPU resources. While both models demonstrated good responsiveness, inference times were relatively longer, indicating that the T1000 provided only marginal acceleration over CPU-only execution. This setup effectively explored the performance boundaries of lightweight models in code extraction and analysis tasks.


Table 3: Model Performance Statistics on Workstation 1
Workstation 2, upgraded with an NVIDIA T4000 GPU (16 GB VRAM), enabled more extensive GPU offloading and supported larger models such as CodeLlama 13B. All models tested on this workstation—Mistral, CodeLlama, and Llama 3.1—achieved significantly faster inference times and maintained very good responsiveness. The enhanced GPU capacity and system resources contributed to reduced CPU load and improved overall efficiency, especially for models with larger parameter sizes.


Table 4: Model Performance Statistics on Workstation 2
This comparison highlights the impact of GPU capability on model performance, with Workstation 2 offering a clear advantage in speed and scalability for more demanding inference tasks.
The multi-agent architecture was evaluated on both ADN and LA3D applications, focusing on error detection in tool code, domain type, and measure point identification. To assess its effectiveness, three key metrics were used: plan reliability, extraction accuracy, and evaluation accuracy.
Since the model produces identical outputs across multiple rounds, we evaluate the results based on a single test run for each error type. The number of extraction and analysis results is based on specific code-checking method: For error types with multiple data points (e.g., Tool Code), the scores are calculated as the average across all data points. For error types that rely on one file to store all relevant code snippets or strings (e.g., Measure Point), the extraction and analysis are conducted based on a single extraction or analysis result.
The following table shows the results of tool code, domain type, and measure point check on both the LA3D and ADN application:


Table 5 – Experiment Results
Table 5 presents the results of all three models across each evaluation stage. Red numbers marked with double asterisks (**) indicate invalid results, while bold blue numbers highlight valid and best-performing outcomes. Both extraction and evaluation accuracy are expressed as percentages, with higher values indicating greater accuracy or reliability.
It is important to note that CodeLlama's evaluation accuracy is considered invalid in cases where a 'PASS' result was generated based on an empty extraction. This highlights a critical flaw in its evaluation logic.
As shown in Table 5, Llama 3.1 consistently outperforms both Mistral and CodeLlama across all stages. Mistral frequently generates plans that deviate from the provided instructions, resulting in unreliable extraction and analysis. Although CodeLlama occasionally reports higher evaluation accuracy, further inspection reveals that it incorrectly treats empty extractions as correct, exposing its limited instruction-following capabilities.
Regarding the three error types analyzed, all models exhibit weak performance, underscoring the limited analytical and comprehension abilities of small-scale LLMs.
Notably, all three models produce consistent and identical outputs for the same prompt, suggesting a relatively low degree of randomness in their generation processes.
Building on the initial success of leveraging LLMs to improve code quality, the following enhancements are being considered to further expand capabilities and effectiveness:
This evaluation helped us better understand how compact, locally hosted LLMs perform in real-world code quality tasks. While the multi-agent architecture provided a strong foundation for automation and flexibility, the models themselves revealed both strengths and limitations.
We saw encouraging results in areas like code extraction and responsiveness—especially when supported by more capable GPUs. At the same time, challenges such as hallucination, limited context retention, and inconsistent analysis highlighted the need for careful model selection and thoughtful system design.
Overall, this case study reinforces the idea that LLMs can play a valuable role in improving software quality, particularly when paired with a modular, instruction-driven framework. As we look ahead, enhancements like more powerful hardware and access to advanced cloud-based models such as GPT-4o offer exciting opportunities to further strengthen this approach.
The authors express their sincere gratitude to the HFE Innovation Committee for their support and for providing the opportunity to present and share this work. We also extend our thanks to Dong Wang, HFE DnM Maxwell Manager, and his team for their valuable guidance and insights throughout the project.
Weijia Yang is a Principal Engineer and Software Architect at SLB, based in Sugar Land, Texas. He has been with SLB for over 20 years, contributing to a wide range of engineering and software initiatives. His current projects focus on downhole data acquisition, wellsite operation optimization, and automation.
Jiaju Chen is a master’s student in Computer Science at Rice University. He joined SLB as a Software Engineer Intern in the summer of 2025. His research centers on human-centered AI design for NLP applications in domain-specific scenarios, including personalized LLM-powered systems for children’s education and datasets for evaluating LLMs’ ability to generate educationally appropriate content.
Carlos Estevez is a Senior IIoT Hardware Engineer leading the global adoption of high-performance Edge computers for SLB’s Well Construction, Reservoir Performance, Digital & Integration, and advanced gateways for Production Services. With 20+ years at SLB, including 6 years in field operations, Carlos has led roles spanning IT, connectivity, cybersecurity, regulatory compliance, and product integration for safety, reliability, and service optimization.
[1] Alam, I., Li, T., Brock, S., & Gupta, P. (2022). DRDebug: Automated Design Rule Debugging. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(2), 606-615.
[2] Lee, J., Kang, B., & Im, E. G. (2013). Rule-based anti-anti-debugging system. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 353-354).
[3] Tian, R., Ye, Y., Qin, Y., Cong, X., Lin, Y., Pan, Y., ... & Sun, M. (2024). Debugbench: Evaluating debugging capability of large language models. arXiv preprint arXiv:2401.04621.
[4] P? durean, V. A., Denny, P., & Singla, A. (2025, February). BugSpotter: Automated Generation of Code Debugging Exercises. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 (pp. 896-902).
[5] Cordeiro, J., Noei, S., & Zou, Y. (2025, May). LLM-Driven Code Refactoring: Opportunities and Limitations. In 2025 IEEE/ACM Second IDE Workshop (IDE) (pp. 32-36). IEEE.
[6] Epperson, W., Bansal, G., Dibia, V. C., Fourney, A., Gerrits, J., Zhu, E., & Amershi, S. (2025, April). Interactive debugging and steering of multi-agent ai systems. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
[7] Islam, M. A., Ali, M. E., & Parvez, M. R. (2025). Codesim: Multi-agent code generation and problem solving through simulation-driven planning and debugging. arXiv preprint arXiv:2502.05664.
[8] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55.
[9] Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., & Zhang, Y. (2023). An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
[10] Zhao, Z., Monti, E., Lehmann, J., & Assem, H. (2024). Enhancing contextual understanding in large language models through contrastive decoding. arXiv preprint arXiv:2405.02750.
Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.