How To Use AI To Bring Clarity in Chaotic Data Lakes: Schema Discovery with LLMs and GNNs

By Naveen Kolli on

November 19, 2025

A step by step guide on how to use LLM + GNN duo to get structure in Data Lakes.

Data Lake is a centralized storage repository that is used to store raw data from multiple sources in a structured, semi-structured and unstructured format.

Imagine a data lake that is full of CSVs, JSON files, logs, spreadsheets that nobody even remembers or knows how they relate. How would you even begin to query across it to get meaningful and useful data? This is where schema discovery comes into the rescue.

A schema is the description or in other terms the blueprint of how data has been structured and organized. Schema discovery is the process of finding or identifying schemas and registering them.

Large Language Model(LLM) is an AI model that can understand and process human language and generate human-like text. Graph Neural Network(GNN) is a special type of neural network that is designed to work with data organized in the form of a graph.

This article is going to talk about the LLM + GNN duo as a solution and show how to move from chaos to structure. We’ll start by looking at the problems with a chaotic data lake and the characteristics of a chaotic data lake. Then we look into why schema discovery is needed and finally we go into depth and discuss how to use LLMs and GNNs to get meaningful structures from chaotic data lakes.

Chaos in Data Lakes

A chaotic data lake is one that is filled with disorganized, low-quality and unmanaged data. Chaotic data lakes make it difficult for Data Engineers and Scientists to access, analyze and use it effectively.

A chaotic data lake is characterized by the following;

Low quality data - Such data is usually inaccurate, inconsistent or has incomplete information. The cause of this is lack of data validation measures and cleanup processes.
Disorganized data - Data is stored in a variety and sometimes incompatible formats, this is because of lack of clear structure for storing data.

Lack of or insufficient metadata - Metadata usually provides context for the data. Thus in such a case it is hard to understand the context, origin or structure of data.

Lack of or insufficient data governance - No established processes or guidelines for managing the data, results in uncontrolled data growth thus overwhelming the data lake and making it hard to maintain.

Why is Schema Discovery important

When we look at data analysis, any analysis that is to be done, relies on a clean schema. For data analysis to be done, a common language is required for the different files. Queries cannot be run across files that are not reconciled. Schema discovery is like a translator that makes the files to ‘speak’ the same language.

If the data is to be used for Machine Learning, the models will not consume raw data, they need data that is structured, cleaned and processed. Inconsistent schema can lead to missing critical features or introducing errors to the data. In this case, schema discovery is like the building foundation for feature engineering.

Schema discovery is not just a housekeeping step — it’s the foundation. Without it, analytics are fragmented and AI is blind.

Why LLMs Alone Aren’t Enough

LLMs have the capability to suggest schema alignments, normalize field names and discover potential relationships. LLMs does this by extracting embeddings.

An embedding is the numerical representation of data in the form of a vector or an array of numbers. Embeddings usually capture the semantic meaning of the data and they show to what extent one data is similar with the other.

The downside is that LLMs are prone to hallucinations, are inconsistent across files and they don’t enforce a global structure. And, this is where GNNs come into play. GNNs enforce structural consistency.

So the LLMs interpret the meaning in messy headers and GNNs organize the structure.

GNNs: The Structural Enforcers

GNNs deal with graphs. Graphs consist of nodes and edges. Nodes are the individual data points or attributes.

In our case, the nodes can be column names, data types or even the entire schema elements. Edges represent the relationships and similarity between node embeddings.

GNNs take in a graph as input and learn the patterns and relations between the connected attributes in the graph.

So how do GNNs enforce structural consistency and complement the LLM?

Looking at the most naive schema discovery which is pairwise similarity. This tends to answer the question ‘Is column A similar to column B’. With pairwise similarity these columns would not be clustered together as such cust_id = customerID = CID, but instead two clusters can be formed cust_id = customerID and customerID = CID. Thus making the graph inconsistent.

GNNs fix this by propagating information over the graph:

Through message passing where node embeddings are not only updated by features but by the features of the neighbours too.
Using global regularizations where by there’s a loss function that indirectly penalizes contradictions. A loss function measures the performance of the neural network. It does this by comparing the predictions and the true labels. Since the regularization is global, the model tries to embed related fields close together in a consistent way across the whole graph.
Through edge level predictions where by given structural content, a GNN can classify whether an edge makes sense. For example, if two fields look similar in isolation, but one always appears in numeric join paths and the other only in text categories, the GNN can learn not to connect them.

LLM + GNN: A Symbiotic Pipeline (Workflow Diagram)

The LLM and GNN complement each other. The LLM as the semantic proposer and the GNN as the structural validator.

Step by Step guide through the Workflow

1. Install the Dependencies

!pip install openai torch torch_geometric pandas scikit-learn numpy

OpenAI key is needed to access the LL

2. Import Libraries

We will start by importing the Libraries that we will use

3. The Data

We will use the following sample datasets as our input

4. Setting up an OpenAI client

This creates a connection with the OpenAI service, the api key gives authorization to make a request.

5. Extracting Schema Info

We will extract information from the data that we will use. The information we will get is the column names, values, column type, unique count and total count.

6. Prepare the data for Embedding

We will now prepare the data we got and turn them into text strings, the embedding model only takes in text as input. We will also get the list of all the column names

7. Extract the Embedding

To extract the embeddings we’re going to use the “text-embedding-3-small”. This is a special type of LLM model that is used for generating embeddings.

8. Create the graph edges

We now have the embeddings that we can use to create the graph edges. We will create the edges based on the cosine similarity of the embeddings. Cosine similarity is a way of measuring similarity between two non-zero vectors.

We will use columns as nodes.

9. Build the graph data

Now that we have the nodes and edges, we will build the graph. We will use Data() from torch_geometric to build the graph.

10. Plot the graph

Then plot the graph using networkx and matplotlib

Result

From the result we can see that the LLM has found a number of connected columns. We will now pass the graph data through a GNN so that it can validate these connections

11. Create GNN

We will then define a GCN (Graph Convolution Neural Network) with two layers. Each layer looks at a node’s features plus its neighbors’ features (based on the graph edges).

In the forward pass function, the first convolution layer takes in the input and then gives its output to the activation function Relu which keeps positive values and adds non-linearity. The output from Relu is passed to the second convolution layer which then gives the final embeddings.

The first layer spreads some information, the ReLU decides what’s useful to keep, and the second layer refines the result into the final node embeddings.

12. Unsupervised Training (Reconstruction Loss)

The input to the model is the node feature vectors (data.x) from the LLM embeddings.

The output is the new embeddings (z) that is generated

Computing similarities

We compute predicted similarities and the true cosine similarities

Compute predicted similarity = sigmoid(z · zᵀ) (dot products between node embeddings).

Compute true similarity = cosine similarity of the original features.

Loss Function

For the loss function, we use Mean Squared Error (MSE) between predicted similarities and true similarities.

Optimization

Optimization is the process by which the model adjusts its parameters so as to minimize the loss function.

For the optimization, we use Adam optimizer.

Over several epochs(training iterations):

The GCN adjusts its parameters.
The reconstruction loss decreases.
Node embeddings become more structured according to both local graph connections and feature similarities.

Backward pass (loss.backward())

The error from the output layer is propagated backward to the input layer to calculate the gradient of the loss function. The gradient is used by the optimizer to update weights.

We now train the model.

We can see that the loss becomes smaller and smaller as we iterate. This means that the model is learning well.

13. Plot GNN output

Finally we have the embeddings from the GNN. We will plot this and then compare the results with the LLM results.

Results

We can see that the GNN has managed to find some connections missed by the LLM. For example, the LLM classified CustomerID, CID, UserID and cust_id together but failed to classify c_num into this cluster. The GNN managed to classify all these into one cluster despite the column names and values being different. Though they all misclassified product_id, this should belong in its own cluster and not be classified with these.

Challenges & Open Questions

As seen from the cluster results, there are columns missed by both the LLM and GNN that could have been clustered together and some clustered together that should not have been clustered. This means humans are still needed in the loop to validate and give feedback. The human experts can provide critical feedback, correcting mis-clustered attributes and confirming relationships that automated systems misinterpret.

Human-in-the-loop validation ensures:

Accuracy: Prevents propagation of false matches in downstream tasks.
Context-awareness: Leverages domain knowledge not easily captured by embeddings.
Iterative refinement: Provides feedback loops to improve system predictions over time

On the brighter side, there are ways in which the process can be optimized to perform better. For the LLM maybe trying out other models. For the GNN having a deeper neural network(have more layers). For both, instead of relying only on embeddings, we can use value type (numeric/string/date), column statistics (mean, variance, %nulls), regex/pattern signals (e.g., looks like email, phone, ID) so as to get reacher feature for nodes beyond semantics.

Conclusion

In this article we walked through the LLM + GNN pipeline. We started off with messy data, then we got the LLM to extract candidate embeddings. We then built a graph (fields as nodes, similarity edges). The graph was then run through GNN to infer clusters/alignments. This highlights how the LMM and GNN complements each other, the LLM as the semantic proposer and the GNN as the structural validator.

The combination of LLMs and GNNs has the potential to transform data engineering and analytics at scale. As organizations are faced with the challenge of ever-growing volumes of semi-structured and unstructured data, these techniques can automate schema discovery, reduce manual data cleaning, and enable more reliable integration across diverse sources.

About the Author

Naveen Kolli is a technology leader with over 18 years of experience in AI/ML, cloud, and digital transformation. He has a proven record of delivering impactful enterprise solutions and actively contributes to the tech community through IEEE engagements, mentorship, and academic collaborations. Passionate about bridging academia and industry, he focuses on advancing AI-driven talent development, ethical innovation, and scalable community tech initiatives.

Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.