How to correctly perform link prediction inference on a new, unseen graph?"

Question

I'm working on an industrial AI use case where I train a Graph Neural Network (GCN) for link prediction — specifically, to predict successor tasks in project planning graphs (e.g., for construction or maintenance workflows).

For example we have this (head of a file of 60 tasks)

ID activity	Name of activity	Equipment Type	Duration
J2M BALLON 001.C1.10	¤¤ TASKS TO BE PERFORMED BEFORE SHUTDOWN ¤¤	TANK	0
J2M BALLON 001.C1.20	Scaffolding setup	TANK	8
J2M BALLON 001.C1.30	Scaffolding inspection	TANK	2
J2M BALLON 001.C1.40	Complete insulation removal	TANK	0

And we want the AI to return this :

For example we have this :

Source ID	Source Name	Predicted Successor
J2M BALLON 001.C1.10	¤¤ TASKS TO BE PERFORMED BEFORE SHUTDOWN ¤¤	Scaffolding setup
J2M BALLON 001.C1.20	Scaffolding setup	Scaffolding inspection
J2M BALLON 001.C1.30	Scaffolding inspection	Complete insulation removal;Measurement pit creation
J2M BALLON 001.C1.40	Complete insulation removal	¤¤ TASKS TO BE PERFORMED DURING SHUTDOWN ¤¤

The dataset has the followed data: id activity, name of activity, equipment type, duration, id successor. I made it into a graphml with the nodes = tasks and edges = dependencies.

I have trained a model (GCN) on a large graph with thousands of task nodes and dependencies, using PyTorch Geometric (i only trained it on one type of equipment to see if it was working before training it on the big dataset). Everything works fine on the training/validation/test splits. (0.90 auc and 0.81 acc)

Now, I want to use this trained model to perform inference on a new graph (i.e., a new set of tasks from a different project, stored in an Excel file), which was not seen during training. I tried doing it with a simple edge-index (all the tasks were connected in order of their id) and the successors were completely false and repetitive !

My key question is:

How can I use my trained GNN model to predict links (i → j) between tasks in a new, unseen graph that has its own nodes and no edges defined yet?

Here are a few sub-points I'm unsure about:

Should I generate a dummy edge_index to allow the GCN to work on the new data?
Is it better to train a GraphSAGE (which is inductive) or fall back to a pure MLP encoder?
How do I ensure consistent feature encoding (TF-IDF, OneHot, StandardScaler, etc.) between training and inference?
Should I re-encode task names using SentenceTransformer instead of TF-IDF to capture semantic similarity better?
Is there any other models that ould be better for my project ? Any guidance or real-world experience on inference with GNNs on new data would be helpful.

Valentas · Answer 1 · 2025-04-26T08:15:22.250

Based on your question and comments, this looks like a very interesting and challenging problem, quite a bit of time might be needed to understand what is possible or solve it.

It seems that some methods for generating graph node/edge embeddings (GraphSAGE or newer attention or transformer based ones) can work on unseen graphs. But in my opinion the main challenge in your case is a quite specific task that requires a very powerful model: the model must not only parse the names of activities, but also have enough technical knowledge about these activities in order to suggest the dependencies correctly.

If the best LLMs you have access to can solve specific instances of your problem, then they can probably be used. Most of them have paid APIs for developers or are available in other paid products. I think there are various forms on inputs that could be experimented with from providing the entire set of activities to providing inputs of the form (Equipment Type, Activity 1, Activity 2) and asking to determine the dependencies. There are also options for paid fine-tuning of these models. When dependencies (edges) are known, the solution can be obtained using the simple graph theory algorithms, see, for example transitive reduction.

If for some reasons you can't use these, then you might try to fine-tune your own Transformer models, like t5 (for example, scripts like this one can be applied quite generally), but they will only work if the task is relatively simple and your dataset represents most of the possible transitions very well.

Also the problem sounds vaguely as something that could be solved using Reinforcement Learning, but maybe someone else can comment on this.

How to correctly perform link prediction inference on a new, unseen graph?"

My key question is:

1 Answers1