I'm working on an industrial AI use case where I train a Graph Neural Network (GCN) for link prediction — specifically, to predict successor tasks in project planning graphs (e.g., for construction or maintenance workflows).
For example we have this (head of a file of 60 tasks)
| ID activity | Name of activity | Equipment Type | Duration |
|---|---|---|---|
| J2M BALLON 001.C1.10 | ¤¤ TASKS TO BE PERFORMED BEFORE SHUTDOWN ¤¤ | TANK | 0 |
| J2M BALLON 001.C1.20 | Scaffolding setup | TANK | 8 |
| J2M BALLON 001.C1.30 | Scaffolding inspection | TANK | 2 |
| J2M BALLON 001.C1.40 | Complete insulation removal | TANK | 0 |
And we want the AI to return this :
For example we have this :
| Source ID | Source Name | Predicted Successor |
|---|---|---|
| J2M BALLON 001.C1.10 | ¤¤ TASKS TO BE PERFORMED BEFORE SHUTDOWN ¤¤ | Scaffolding setup |
| J2M BALLON 001.C1.20 | Scaffolding setup | Scaffolding inspection |
| J2M BALLON 001.C1.30 | Scaffolding inspection | Complete insulation removal;Measurement pit creation |
| J2M BALLON 001.C1.40 | Complete insulation removal | ¤¤ TASKS TO BE PERFORMED DURING SHUTDOWN ¤¤ |
The dataset has the followed data: id activity, name of activity, equipment type, duration, id successor. I made it into a graphml with the nodes = tasks and edges = dependencies.
I have trained a model (GCN) on a large graph with thousands of task nodes and dependencies, using PyTorch Geometric (i only trained it on one type of equipment to see if it was working before training it on the big dataset). Everything works fine on the training/validation/test splits. (0.90 auc and 0.81 acc)
Now, I want to use this trained model to perform inference on a new graph (i.e., a new set of tasks from a different project, stored in an Excel file), which was not seen during training. I tried doing it with a simple edge-index (all the tasks were connected in order of their id) and the successors were completely false and repetitive !
My key question is:
How can I use my trained GNN model to predict links (i → j) between tasks in a new, unseen graph that has its own nodes and no edges defined yet?
Here are a few sub-points I'm unsure about:
- Should I generate a dummy
edge_indexto allow the GCN to work on the new data? - Is it better to train a GraphSAGE (which is inductive) or fall back to a pure MLP encoder?
- How do I ensure consistent feature encoding (TF-IDF, OneHot, StandardScaler, etc.) between training and inference?
- Should I re-encode task names using
SentenceTransformerinstead of TF-IDF to capture semantic similarity better? - Is there any other models that ould be better for my project ? Any guidance or real-world experience on inference with GNNs on new data would be helpful.