Case Study: Leveraging Large Language Models to Enhance Data Acquisition Software Quality in Oil & Gas Industry

By Weijia Yang, Jiaju , Carlos Estevez on

January 15, 2026

Ensuring high code quality is critical for developing maintainable, scalable, and secure software such as Maxwell (a wellsite data acquisition software platform). Traditional static analysis tools often fall short in detecting context-sensitive issues, particularly those involving dynamic references or loosely typed languages. Moreover, these tools typically rely on manually crafted rules tailored to specific error types, limiting their adaptability to novel or unforeseen problems [1][2].

To address these challenges, a rule-based error checking system has been implemented (figure 1). This solution uses a set of predefined rules written in YAML files, which are parsed and executed sequentially by a Python script. These rules target specific patterns and known error types, offering a structured and deterministic approach to error detection. While effective for well-understood issues, this method requires continuous manual updates and struggles to generalize across diverse or evolving codebases.

Figure 1 – Rule-based MW Error Detection Workflow

Recent advancements in Large Language Models (LLMs) have opened new possibilities for more automated and intuitive error detection, thanks to their contextual understanding and reasoning capabilities [3][4]. However, most LLMs are designed to analyze individual scripts and struggle with large-scale codebases. They face challenges in navigating across multiple files, integrating context from disparate parts of the codebase, and performing reliable, comprehensive analysis [5].

To overcome these limitations, multi-agent LLM frameworks have gained attention for their collaborative problem-solving capabilities. By coordinating multiple LLM agents, these frameworks show promise in tackling complex tasks such as automated code checking across extensive codebases [6][7].

This case study explores the use of LLMs to enhance software quality, with a focus on detecting reference errors in Maxwell source code. These errors often appear in plain text or string literals, such as configuration keys, API endpoints, file paths, and function names, and are not typically validated by compilers, making them prone to runtime failures and broken integrations. Examples include Tool Code, Domain Code, and Measure Point.

By evaluating the effectiveness of LLMs in identifying such issues, the study aims to assess their potential and limitations in automating error detection, particularly in comparison to existing rule-based approaches.

LLM Selection and Deployment Strategy

When selecting a suitable large language model (LLM) for a project, several key factors should be considered. Local models offer benefits such as no cloud-related costs, full data privacy, and fast deployment, making them ideal for environments with strict data control requirements. However, they are often constrained by limited model size and depth. On the other hand, cloud-based models typically deliver higher accuracy and more advanced capabilities due to their access to greater computational resources. These advantages come with trade-offs, including increased latency, higher operational costs, a slower approval process, and potential privacy concerns. Balancing these factors is essential to choosing the right model for your specific use case.

Larger LLMs generally require more hardware resources, such as system RAM, GPU VRAM, and processing power. Choose a model size that balances performance needs with available infrastructure.

Table 1 – Rule of Thumb

Driven by constraints in privacy, time, and budget, this experiment was conducted using compact LLMs deployed on a self-contained Dell PowerEdge XR12 workstation. The following section outlines the basic hardware configuration used to support local deployment.

CPU: 20-core AMD 64-bit processor
- Intel® Xeon® Silver, 4310T 2.3G, 10C/20T,10.4GT/s, 15M Cache, Turbo, HT (105W)
Memory: 125 GB RAM
Storage: 1.7 TB SSD
GPU: NVIDIA T1000 (8 GB VRAM)
Or NVIDIA RTX A4000 (T4000, 16 GB VRAM)

This setup enabled controlled testing of lightweight models within a constrained timeframe. Given the limited resources, Mistral was initially selected for its balance of performance and efficiency. To further benchmark error detection capabilities across different LLMs, Mistral 7B, CodeLlama 13B, and Llama 3.1 8B were deployed locally using the Ollama runtime.

Why Mistral?

Open-weight and Efficient: Offers strong performance without the licensing restrictions of proprietary models.

Why CodeLlama?

Specialized for Code Tasks: CodeLlama is explicitly pretrained on extensive code datasets, making it highly specialized for understanding, generating, and debugging programming code.
Supports Complex Code Workflows: With its focus on code syntax and semantics, CodeLlama excels in tasks such as code generation, refactoring, and static analysis. This specialization allows it to assist in streamlining software development workflows and improving efficiency in coding projects.

Why Llama 3.1?

Expanded Context Length: Llama 3.1 introduces an extended context length, making it particularly well-suited for handling longer documents, scripts, or datasets.
Instruction-Following Capabilities: The model demonstrates enhanced capabilities in adhering to user-provided instructions, allowing for more precise task execution.
Versatile and Open-Weight: Similar to CodeLlama, Llama 3.1 is open-weight, allowing for customization and deployment in diverse environments.

Why Ollama?

Simple Integration: Provides an easy-to-use CLI and API for seamless local deployment.
GPU Acceleration: Leverages available hardware to boost inference speed.
Flexible Model Management: Allows quick switching and management of multiple models

Initial Architecture: Single-LLM Setup

During the case study, we defined three pilot tasks to evaluate code quality from distinct dimensions: Tool Code, Measure Point, and Domain Code. The initial approach used a single LLM to manage all code quality tasks, including Tool Code checking, Domain Code checking, and Measure Point validation. For each error checking category, the program takes user-provided code scripts as input and leverages 1) rule-based methods for trivial extraction tasks, and 2) an LLM for more complex and non-deterministic extraction and code analysis.

Take the tool code check on the LA3D platform as an example. Verifying the consistency of the tool code involves the following steps:

Identify the OSDD code argument: Locate the OSDD code argument (e.g., LA3D_TOOL_LARX6_1_OSDD_CODE) in the constructor functions of each piece of equipment.
Extract the defined tool code from the constants file: Retrieve the tool code associated with the OSDD argument in the constants file.
Extract the tool code from the equipment's XML file: Identify the tool code defined for the equipment within the .xml file.

The tool codes extracted in steps 2 and 3 must match to ensure consistency. To verify tool code consistency, the program requires the user to provide the file path for each step.

Step 1 - Identify OSDD code arguments: The program uses an LLM to locate constructor functions, extract the first argument, and verify if it matches the format of an OSDD code argument. The extracted OSDD code argument is then used as part of a regex pattern for the next step.
Steps 2 and 3 - Extract tool codes: The program employs regex to extract the tool code from the constants file (step 2) and the equipment’s XML file (step 3) using the OSDD code argument as a reference.

Finally, the program compares the tool codes extracted in steps 2 and 3 to check for consistency and ensure the correctness of the tool code used.

We quickly recognized that, due to inherent limitations in LLMs—such as constrained context windows and inconsistent task generalization—it was impractical to rely on a single model to handle all error-checking tasks effectively. To optimize resource utilization and reduce computational overhead, we instantiated three dedicated LLM instances, each assigned to a specific task: Tool Code validation, Measure Point verification, and Domain Code analysis (Figure 2). This task-specific deployment enabled parallel execution and allowed each model to operate within a well-defined scope, improving both processing efficiency and analytical accuracy.

Figure 2 – High-level Multi-Instance MW Error Detection Workflow

While this solution is easy to implement and yields relatively accurate extraction results, it requires substantial time and human expertise for manual re-writing (e.g., programming the extraction rules, preparing specific code scripts for error checking, etc.) for specific error checking tasks. Therefore, it is challenging for this method to automatically generalize to different tasks.

Evolved Architecture: Multi-Agent LLM System

To address the limitations of the single-LLM setup, we adopted a multi-agent LLM architecture to make this system more generalizable and automate the entire error-checking pipeline. An LLM agent is an autonomous, instruction-following entity powered by a language model. It can take tasks, make decisions, and pass information to other agents (also backed by LLMs).

We defined a three-agent pipeline for error checking:

A Planner, which reads user instructions and generates a sequence of high-level steps.
An Extractor, which reads each step and pulls the corresponding code snippets from the codebase.
An Analyzer, which synthesizes the code and instructions to produce a validation result.

As shown in Figure 3, with this setup, the user only needs to write a list of checking instructions in natural language, including the task description, checking instructions, validation criteria, and file locations. The system will automatically plan the error checking steps and navigate through the code base to conduct the checking. When it comes to a new error checking task, the user only needs to draft a new instruction in plain text.

Figure 3 – Detailed Workflow of the Multi-Agent Collaboration Pipeline

This modular approach improved task specialization, accuracy, system scalability, and robustness.

Planner Agent

The planner agent is responsible for generating a detailed error detection plan based on user-provided instructions. In particular, the generated step-by-step plan is required to have:

Target Data: The data that is expected to be extracted
File Location: The location of source files for extraction
Specific Instructions: Detailed extraction instructions, including what to extract how to extract, and where to extract
Dependencies: A list of steps whose extracted data are required to be considered for the extraction of the current step
Extraction Type: Depending on instructions of how to extract, the agent needs to decide whether the extraction should be deterministic (i.e., simply use a regex to match the expected data) or non-deterministic (i.e., require LLM to comprehend the code snippet and extract the expected data based on the instructions)
Match Pattern: If the extraction type is ‘deterministic’, the agent will generate a regex to match the expected data.

Extractor Agent

The extractor agent is mainly responsible for extracting data for non-deterministic extraction types. For instance, if an extraction step requires the LLM to identify a particular name of a function call with similar format of ‘ModImpl’ and extract the first argument, the extractor agent is prompted to read through the code snippet, identify qualified function name, extract the first arguments of such functions, and return the result in a list.

For a long code script, the extraction is conducted by splitting the script into multiple chunks to ensure the extraction quality. To prevent the impact of LLM’s hallucination, after each extraction, there is a separate validation function that validates whether the extracted content exists in the original code snippet. All non-existed extraction will be removed to ensure the reliability of extraction. After all chunks are extracted, the agent will merge all the extracted content into a consolidated result for further analysis.

Analyzer Agent

The analyzer agent is responsible for performing the final checking of the targeted error type based on the extracted data from previous steps. The agent is provided with error checking instructions generated by the planner agent, along with the extracted data produced by the extractor agent.

The architecture is designed to be flexible and user-friendly—allowing developers to describe tasks in plain language while the system handles the complexity behind the scenes. It’s a step toward making intelligent code analysis more accessible and less manual.

Test Dataset and Code Samples

The experiment utilized a combination of dummy codebases and real-world Maxwell codebases written in C++ and XML. These included a range of Maxwell tool applications from small to medium scale, specifically ADN (a density and porosity downhole tool with approximately 72,000 lines), and LA3D (a downhole resistivity tool with over 260,000 lines). This selection was designed to provide a diverse representation of code complexity, structure, and size. Preprocessing steps involved anonymizing sensitive data, enforcing consistent formatting standards, and segmenting large files to facilitate analysis and maintain data integrity. This setup ensured a realistic and scalable testbed for evaluating the proposed methods.

Key Challenges Observed

While applying small-scale LLMs to improve code quality, several limitations became apparent during experimentation. These challenges highlight the constraints of current model capabilities and underscore the importance of careful model selection and prompt design. The following subsections detail the most prominent issues observed: hallucination, catastrophic forgetting, and limited context comprehension.

Hallucination

The first challenge when applying these small-scale LLMs is ‘hallucination’. Hallucination typically refers to LLM’s irrelevant output, which is not related to the provided context or prompt. As shown in the left part of Figure 4, when using Mistral for extracting tool code, although the extracted result follows the format of an OSDD code argument, the content of the string is ‘invented’ by the LLM. Similarly, in the right part of Figure 4, CodeLlama outputs python functions, which neither exist in the original code script nor follow the requested instructions.

The reason for LLM to ‘hallucinate’ primarily arises from the pre-trained data. As models heavily rely on the data they are trained on, these models are more likely to output content resemble the training data instead of demonstrating full comprehension of the task [8]. Therefore, the reliance of training data and the probabilistic nature of language models result in the ‘hallucination’ in LLM outputs.

Figure 4 - Examples of hallucination in LLMs’ outputs

Catastrophic Forgetting

Catastrophic Forgetting refers to the phenomenon where an LLM loses previously acquired knowledge when it learns new information. In the scenario of error detection, when the input code script is too long, the LLM would forget the code content that appears earlier. Therefore, even if qualified code content (e.g., the tool code string, such as ‘LAPD_675’) exists in the input code script, the LLM would focus more on the latter part of the code script and fail to extract the qualified code content. As shown in Figure 5, the LLM is instructed to extract a tool code string from a .xml file. Since the .xml file is relatively long and repetitive, the model’s output mainly consists of unqualified strings appearing in the latter part of the file.

Figure 5 – Example of LLMs’ catastrophic forgetting

Typically, catastrophic forgetting occurs because LLMs encode knowledge across vast matrices of weights and biases, which can disrupt the delicate balance between maintaining existing knowledge and learning new information [9]. In addition, larger models often show higher resistance to catastrophic forgetting. Since our case study mainly leverages small LLMs, it is understandable that some models could not keep all relevant information in ‘mind’.

Limited Context Comprehension Capabilities

Another primary challenge facing smaller LLMs is their limited capabilities to comprehend the context. These models often fail to fully comprehend the instructions during the analysis stage, producing inconsistent results. As shown in Figure 6, the extracted results (the left part) of step 2 and step 3 are consistent. Thus, the analysis result should be a ‘PASS’ in this case. However, the model outputs ‘FAIL’ despite the correct extraction results.

Figure 6 – Example of LLMs’ failure in conducting correct analysis

LLM’s limited contextual understanding is mainly caused by their limited context windows [10]. A well-sized context window improves LLMs’ ability to generate coherent text and understand complex relationships in the input. However, larger context windows require more computational resources, both in terms of memory and processing time.

Model Performance Analysis

Based on the deployment and evaluation of three distinct LLMs, here are the key findings derived from their performance across targeted code quality tasks.

Common Pros Across All Models

Strong language understanding: All three models demonstrate solid capabilities in natural language comprehension.
Useful for automation: Each model contributes meaningfully to automating code quality tasks.
Local deployment: They are compact enough to run efficiently on local hardware, supporting privacy and control.

Common Cons

Hallucination risk: All models are prone to generating inaccurate or fabricated outputs, which can undermine reliability.
Need for human oversight: Due to occasional misinterpretations or formatting issues, manual review remains important.

Table 2: Model Performance Analysis

The Mistral 7B model is an example of how efficiency in parameter utilization can lead to high performance, making it a lightweight yet effective language model. With only 7 billion parameters, it demonstrates remarkable achievements in tasks requiring natural language understanding and generation, offering a smaller and faster alternative to larger models while retaining competitive performance. However, its limited role-playing capabilities restrict its applicability in the multi-agent architecture that requires nuanced interactions and persona-based responses. Furthermore, Mistral 7B struggles with code comprehension, making it less suitable for programming-related use cases such as debugging or code generation. A major drawback of this model is its high tendency to hallucinate, where it generates inaccurate or fabricated information. This issue undermines the reliability of its outputs, especially in high-stakes applications like legal or scientific writing, where factual correctness is critical.

CodeLlama 13B, on the other hand, has been explicitly pretrained on large-scale code datasets, which significantly enhances its ability to understand and generate programming code across multiple languages. This specialization makes it a powerful tool for software developers, particularly for tasks such as code generation, refactoring, and debugging. However, its primary limitation lies in its inability to consistently follow specified output formats, which can be frustrating when precise formatting is crucial, such as when adhering to coding standards or generating structured data. Additionally, like many generative models, CodeLlama 13B is prone to hallucination, especially when generating non-existent code or misinterpreting requirements. This behavior poses risks in software development, where incorrect or misleading code can lead to inefficiencies, bugs, or even security vulnerabilities. Overall, while CodeLlama 13B excels in code comprehension and generation, its limitations highlight the importance of human oversight in its application.

Llama 3.1 8B introduces notable improvements, such as an expanded context length, which allows it to process and generate longer sequences of text or code scripts. This capability is particularly advantageous for analyzing or generating large-scale projects where maintaining continuity across multiple sections is essential. The model also demonstrates enhanced instruction-following capabilities, making it more adept at adhering to user directives and performing tasks with greater precision. However, Llama 3.1 8B struggles with generating reliable analytical insights, which may limit its effectiveness in tasks requiring critical reasoning or domain-specific evaluations. For instance, while it can process extensive input data, its analysis may lack depth or factual accuracy, potentially leading to suboptimal outcomes in research or data-driven decision-making. These limitations suggest that while Llama 3.1 8B offers significant advancements in context handling and usability, it requires complementary tools or human intervention for more complex analytical tasks.

GPU Impact

We utilized two Dell XR12 workstations, each equipped with a different grade of NVIDIA GPU, to evaluate model performance under varying hardware capabilities.

Workstation 1, equipped with an NVIDIA T1000 GPU (8 GB VRAM), was used to evaluate compact models such as Mistral 7B and Llama 3.1 8B. These models were selected due to their lower VRAM requirements, making them suitable for environments with limited GPU resources. While both models demonstrated good responsiveness, inference times were relatively longer, indicating that the T1000 provided only marginal acceleration over CPU-only execution. This setup effectively explored the performance boundaries of lightweight models in code extraction and analysis tasks.

Table 3: Model Performance Statistics on Workstation 1

Workstation 2, upgraded with an NVIDIA T4000 GPU (16 GB VRAM), enabled more extensive GPU offloading and supported larger models such as CodeLlama 13B. All models tested on this workstation—Mistral, CodeLlama, and Llama 3.1—achieved significantly faster inference times and maintained very good responsiveness. The enhanced GPU capacity and system resources contributed to reduced CPU load and improved overall efficiency, especially for models with larger parameter sizes.

Table 4: Model Performance Statistics on Workstation 2

This comparison highlights the impact of GPU capability on model performance, with Workstation 2 offering a clear advantage in speed and scalability for more demanding inference tasks.

Results Summary

The multi-agent architecture was evaluated on both ADN and LA3D applications, focusing on error detection in tool code, domain type, and measure point identification. To assess its effectiveness, three key metrics were used: plan reliability, extraction accuracy, and evaluation accuracy.

Planning Reliability: The reliability is evaluated by a human software developer familiar with the error detection method. Each plan is rated on a 5-point Likert scale, where 1 indicates "strongly unsatisfied" and 5 indicates "strongly satisfied."
Extraction Accuracy: Accurate extraction requires capturing all qualified code strings within a code script. If irrelevant code strings are included or relevant code strings are missed, the extraction for that specific step is deemed a failure. To evaluate performance, we calculate the percentage of accurate extractions out of the total extractions for each type of error.
Evaluation Accuracy: We calculate the percentage of correct evaluation results output total generated results.

Since the model produces identical outputs across multiple rounds, we evaluate the results based on a single test run for each error type. The number of extraction and analysis results is based on specific code-checking method: For error types with multiple data points (e.g., Tool Code), the scores are calculated as the average across all data points. For error types that rely on one file to store all relevant code snippets or strings (e.g., Measure Point), the extraction and analysis are conducted based on a single extraction or analysis result.

The following table shows the results of tool code, domain type, and measure point check on both the LA3D and ADN application:

Table 5 – Experiment Results

Table 5 presents the results of all three models across each evaluation stage. Red numbers marked with double asterisks (**) indicate invalid results, while bold blue numbers highlight valid and best-performing outcomes. Both extraction and evaluation accuracy are expressed as percentages, with higher values indicating greater accuracy or reliability.

It is important to note that CodeLlama's evaluation accuracy is considered invalid in cases where a 'PASS' result was generated based on an empty extraction. This highlights a critical flaw in its evaluation logic.

As shown in Table 5, Llama 3.1 consistently outperforms both Mistral and CodeLlama across all stages. Mistral frequently generates plans that deviate from the provided instructions, resulting in unreliable extraction and analysis. Although CodeLlama occasionally reports higher evaluation accuracy, further inspection reveals that it incorrectly treats empty extractions as correct, exposing its limited instruction-following capabilities.

Regarding the three error types analyzed, all models exhibit weak performance, underscoring the limited analytical and comprehension abilities of small-scale LLMs.

Notably, all three models produce consistent and identical outputs for the same prompt, suggesting a relatively low degree of randomness in their generation processes.

Future Enhancements

Building on the initial success of leveraging LLMs to improve code quality, the following enhancements are being considered to further expand capabilities and effectiveness:

Upgrade the local GPU to a higher-grade model with increased parameter capacity, enabling support for more advanced models and faster inference.
Explore the use of quantized larger models locally to balance performance with resource efficiency, allowing for broader experimentation without compromising speed.
Integrate SLB-approved cloud-based advanced LLMs, such as GPT-4o (Omni), to access cutting-edge capabilities and scale model usage across more complex coding scenarios.

Conclusion

This evaluation helped us better understand how compact, locally hosted LLMs perform in real-world code quality tasks. While the multi-agent architecture provided a strong foundation for automation and flexibility, the models themselves revealed both strengths and limitations.

We saw encouraging results in areas like code extraction and responsiveness—especially when supported by more capable GPUs. At the same time, challenges such as hallucination, limited context retention, and inconsistent analysis highlighted the need for careful model selection and thoughtful system design.

Overall, this case study reinforces the idea that LLMs can play a valuable role in improving software quality, particularly when paired with a modular, instruction-driven framework. As we look ahead, enhancements like more powerful hardware and access to advanced cloud-based models such as GPT-4o offer exciting opportunities to further strengthen this approach.

Acknowledgment

The authors express their sincere gratitude to the HFE Innovation Committee for their support and for providing the opportunity to present and share this work. We also extend our thanks to Dong Wang, HFE DnM Maxwell Manager, and his team for their valuable guidance and insights throughout the project.

Authors

Weijia Yang is a Principal Engineer and Software Architect at SLB, based in Sugar Land, Texas. He has been with SLB for over 20 years, contributing to a wide range of engineering and software initiatives. His current projects focus on downhole data acquisition, wellsite operation optimization, and automation.

Jiaju Chen is a master’s student in Computer Science at Rice University. He joined SLB as a Software Engineer Intern in the summer of 2025. His research centers on human-centered AI design for NLP applications in domain-specific scenarios, including personalized LLM-powered systems for children’s education and datasets for evaluating LLMs’ ability to generate educationally appropriate content.

Carlos Estevez is a Senior IIoT Hardware Engineer leading the global adoption of high-performance Edge computers for SLB’s Well Construction, Reservoir Performance, Digital & Integration, and advanced gateways for Production Services. With 20+ years at SLB, including 6 years in field operations, Carlos has led roles spanning IT, connectivity, cybersecurity, regulatory compliance, and product integration for safety, reliability, and service optimization.

References

[1] Alam, I., Li, T., Brock, S., & Gupta, P. (2022). DRDebug: Automated Design Rule Debugging. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(2), 606-615.

[2] Lee, J., Kang, B., & Im, E. G. (2013). Rule-based anti-anti-debugging system. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 353-354).

[3] Tian, R., Ye, Y., Qin, Y., Cong, X., Lin, Y., Pan, Y., ... & Sun, M. (2024). Debugbench: Evaluating debugging capability of large language models. arXiv preprint arXiv:2401.04621.

[4] P? durean, V. A., Denny, P., & Singla, A. (2025, February). BugSpotter: Automated Generation of Code Debugging Exercises. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 (pp. 896-902).

[5] Cordeiro, J., Noei, S., & Zou, Y. (2025, May). LLM-Driven Code Refactoring: Opportunities and Limitations. In 2025 IEEE/ACM Second IDE Workshop (IDE) (pp. 32-36). IEEE.

[6] Epperson, W., Bansal, G., Dibia, V. C., Fourney, A., Gerrits, J., Zhu, E., & Amershi, S. (2025, April). Interactive debugging and steering of multi-agent ai systems. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1-15).

[7] Islam, M. A., Ali, M. E., & Parvez, M. R. (2025). Codesim: Multi-agent code generation and problem solving through simulation-driven planning and debugging. arXiv preprint arXiv:2502.05664.

[8] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55.

[9] Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., & Zhang, Y. (2023). An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.

[10] Zhao, Z., Monti, E., Lehmann, J., & Assem, H. (2024). Enhancing contextual understanding in large language models through contrastive decoding. arXiv preprint arXiv:2405.02750.

Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.