A step by step guide on how to use LLM + GNN duo to get structure in Data Lakes.
Data Lake is a centralized storage repository that is used to store raw data from multiple sources in a structured, semi-structured and unstructured format.
Imagine a data lake that is full of CSVs, JSON files, logs, spreadsheets that nobody even remembers or knows how they relate. How would you even begin to query across it to get meaningful and useful data? This is where schema discovery comes into the rescue.
A schema is the description or in other terms the blueprint of how data has been structured and organized. Schema discovery is the process of finding or identifying schemas and registering them.
Large Language Model(LLM) is an AI model that can understand and process human language and generate human-like text. Graph Neural Network(GNN) is a special type of neural network that is designed to work with data organized in the form of a graph.
This article is going to talk about the LLM + GNN duo as a solution and show how to move from chaos to structure. We’ll start by looking at the problems with a chaotic data lake and the characteristics of a chaotic data lake. Then we look into why schema discovery is needed and finally we go into depth and discuss how to use LLMs and GNNs to get meaningful structures from chaotic data lakes.
A chaotic data lake is one that is filled with disorganized, low-quality and unmanaged data. Chaotic data lakes make it difficult for Data Engineers and Scientists to access, analyze and use it effectively.

When we look at data analysis, any analysis that is to be done, relies on a clean schema. For data analysis to be done, a common language is required for the different files. Queries cannot be run across files that are not reconciled. Schema discovery is like a translator that makes the files to ‘speak’ the same language.
If the data is to be used for Machine Learning, the models will not consume raw data, they need data that is structured, cleaned and processed. Inconsistent schema can lead to missing critical features or introducing errors to the data. In this case, schema discovery is like the building foundation for feature engineering.
Schema discovery is not just a housekeeping step — it’s the foundation. Without it, analytics are fragmented and AI is blind.

LLMs have the capability to suggest schema alignments, normalize field names and discover potential relationships. LLMs does this by extracting embeddings.
An embedding is the numerical representation of data in the form of a vector or an array of numbers. Embeddings usually capture the semantic meaning of the data and they show to what extent one data is similar with the other.
The downside is that LLMs are prone to hallucinations, are inconsistent across files and they don’t enforce a global structure. And, this is where GNNs come into play. GNNs enforce structural consistency.
So the LLMs interpret the meaning in messy headers and GNNs organize the structure.
GNNs deal with graphs. Graphs consist of nodes and edges. Nodes are the individual data points or attributes.

In our case, the nodes can be column names, data types or even the entire schema elements. Edges represent the relationships and similarity between node embeddings.
GNNs take in a graph as input and learn the patterns and relations between the connected attributes in the graph.
So how do GNNs enforce structural consistency and complement the LLM?
Looking at the most naive schema discovery which is pairwise similarity. This tends to answer the question ‘Is column A similar to column B’. With pairwise similarity these columns would not be clustered together as such cust_id = customerID = CID, but instead two clusters can be formed cust_id = customerID and customerID = CID. Thus making the graph inconsistent.
GNNs fix this by propagating information over the graph:


The LLM and GNN complement each other. The LLM as the semantic proposer and the GNN as the structural validator.
1. Install the Dependencies
!pip install openai torch torch_geometric pandas scikit-learn numpy
OpenAI key is needed to access the LL
2. Import Libraries
We will start by importing the Libraries that we will use


3. The Data
We will use the following sample datasets as our input


4. Setting up an OpenAI client
This creates a connection with the OpenAI service, the api key gives authorization to make a request.


5. Extracting Schema Info
We will extract information from the data that we will use. The information we will get is the column names, values, column type, unique count and total count.


6. Prepare the data for Embedding
We will now prepare the data we got and turn them into text strings, the embedding model only takes in text as input. We will also get the list of all the column names


7. Extract the Embedding
To extract the embeddings we’re going to use the “text-embedding-3-small”. This is a special type of LLM model that is used for generating embeddings.


8. Create the graph edges
We now have the embeddings that we can use to create the graph edges. We will create the edges based on the cosine similarity of the embeddings. Cosine similarity is a way of measuring similarity between two non-zero vectors.
We will use columns as nodes.


9. Build the graph data
Now that we have the nodes and edges, we will build the graph. We will use Data() from torch_geometric to build the graph.


10. Plot the graph
Then plot the graph using networkx and matplotlib


Result
From the result we can see that the LLM has found a number of connected columns. We will now pass the graph data through a GNN so that it can validate these connections


11. Create GNN
We will then define a GCN (Graph Convolution Neural Network) with two layers. Each layer looks at a node’s features plus its neighbors’ features (based on the graph edges).
In the forward pass function, the first convolution layer takes in the input and then gives its output to the activation function Relu which keeps positive values and adds non-linearity. The output from Relu is passed to the second convolution layer which then gives the final embeddings.


The first layer spreads some information, the ReLU decides what’s useful to keep, and the second layer refines the result into the final node embeddings.
12. Unsupervised Training (Reconstruction Loss)


The input to the model is the node feature vectors (data.x) from the LLM embeddings.
The output is the new embeddings (z) that is generated
Computing similarities
We compute predicted similarities and the true cosine similarities
Compute predicted similarity = sigmoid(z · zᵀ) (dot products between node embeddings).
Compute true similarity = cosine similarity of the original features.
Loss Function
For the loss function, we use Mean Squared Error (MSE) between predicted similarities and true similarities.
Optimization
Optimization is the process by which the model adjusts its parameters so as to minimize the loss function.
For the optimization, we use Adam optimizer.
Over several epochs(training iterations):
Backward pass (loss.backward())
The error from the output layer is propagated backward to the input layer to calculate the gradient of the loss function. The gradient is used by the optimizer to update weights.
We now train the model.


We can see that the loss becomes smaller and smaller as we iterate. This means that the model is learning well.
13. Plot GNN output
Finally we have the embeddings from the GNN. We will plot this and then compare the results with the LLM results.


Results


We can see that the GNN has managed to find some connections missed by the LLM. For example, the LLM classified CustomerID, CID, UserID and cust_id together but failed to classify c_num into this cluster. The GNN managed to classify all these into one cluster despite the column names and values being different. Though they all misclassified product_id, this should belong in its own cluster and not be classified with these.
Challenges & Open Questions
As seen from the cluster results, there are columns missed by both the LLM and GNN that could have been clustered together and some clustered together that should not have been clustered. This means humans are still needed in the loop to validate and give feedback. The human experts can provide critical feedback, correcting mis-clustered attributes and confirming relationships that automated systems misinterpret.
Human-in-the-loop validation ensures:
On the brighter side, there are ways in which the process can be optimized to perform better. For the LLM maybe trying out other models. For the GNN having a deeper neural network(have more layers). For both, instead of relying only on embeddings, we can use value type (numeric/string/date), column statistics (mean, variance, %nulls), regex/pattern signals (e.g., looks like email, phone, ID) so as to get reacher feature for nodes beyond semantics.
In this article we walked through the LLM + GNN pipeline. We started off with messy data, then we got the LLM to extract candidate embeddings. We then built a graph (fields as nodes, similarity edges). The graph was then run through GNN to infer clusters/alignments. This highlights how the LMM and GNN complements each other, the LLM as the semantic proposer and the GNN as the structural validator.
The combination of LLMs and GNNs has the potential to transform data engineering and analytics at scale. As organizations are faced with the challenge of ever-growing volumes of semi-structured and unstructured data, these techniques can automate schema discovery, reduce manual data cleaning, and enable more reliable integration across diverse sources.
Naveen Kolli is a technology leader with over 18 years of experience in AI/ML, cloud, and digital transformation. He has a proven record of delivering impactful enterprise solutions and actively contributes to the tech community through IEEE engagements, mentorship, and academic collaborations. Passionate about bridging academia and industry, he focuses on advancing AI-driven talent development, ethical innovation, and scalable community tech initiatives.
Disclaimer: The authors are completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position nor that of the Computer Society nor its Leadership.