Shape-Aware Validation in TGraphX: Catching Bugs Before They Matter
If you have spent a few weeks doing GNN research in PyTorch, you have probably encountered something like this:
RuntimeError: shape '[64, 128]' is invalid for input of size 12544
The shape mismatch happened twelve layers deep inside a stack of message-passing operations. The traceback is unhelpful. The data preprocessing was fine. The model was fine. Something subtle changed — an edge index dropped a node, a feature tensor lost a dimension, a batch was assembled wrong.
This article is about a small but useful design choice in TGraphX: shape-aware validation at the data layer, so these bugs surface in milliseconds instead of after an hour of training.
The validation primitives
TGraphX exposes three validation utilities for graph data:
import tgraphx as tgx
# Basic — checks shapes, edge index bounds, label alignment
tgx.validate_graph(g)
# Strict — raises on any anomaly instead of returning a result object
tgx.validate_graph(g, strict=True)
# Specific assertion — node features must be at least rank-3
tgx.assert_tensor_native(g, min_rank=3)
# Broader invariants — also checks edge attribute alignment and metadata
tgx.check_graph_invariants(g)
Calling tgx.validate_graph(g, strict=True) once after constructing the graph object turns a class of opaque runtime errors into a clear ValueError at construction time.
What does validate_graph actually check?
The list, paraphrased from the source:
x(node features) must be a tensor.edge_indexmust be a[2, E]long tensor (or empty).- Every entry in
edge_indexmust be0 <= idx < num_nodes. - If
labelsis present, its leading dimension must equalnum_nodes. - If
edge_attris present, its leading dimension must equal the edge count. - If
node_maskoredge_maskis present, their leading dimensions must match. - All tensors must be on the same device.
These are the most common silent failure modes in GNN data assembly.
A bug validate_graph catches
import torch
import tgraphx as tgx
N = 100
x = torch.randn(N, 3, 8, 8)
labels = torch.randint(0, 5, (N,))
# Edge index accidentally includes a node ID >= N
edge_index = torch.tensor([
[0, 1, 99, 100], # last index is out of bounds!
[1, 2, 0, 3],
], dtype=torch.long)
g = tgx.Graph(x=x, edge_index=edge_index, labels=labels)
tgx.validate_graph(g, strict=True)
# → ValueError: edge_index contains node ID 100 but num_nodes is 100
Without validation, this would crash during message passing with a CUDA assertion that took fifteen lines of traceback to interpret.
A bug assert_tensor_native catches
You wrote a pipeline that flattens features at one stage and forgot to remove the flatten:
x = torch.randn(100, 3, 8, 8)
x = x.view(100, -1) # accidentally flattened to [100, 192]
g = tgx.Graph(x=x, edge_index=edge_index, labels=labels)
tgx.assert_tensor_native(g, min_rank=3)
# → AssertionError: node features have rank 2, expected at least 3
If you intended to keep tensor structure but lost it during preprocessing, this catches the regression. It also fails fast on accidental reshapes inside a long preprocessing chain.
When validation is too expensive
Validation is cheap but not free. For a graph with 10 million nodes, scanning the entire edge_index to check bounds is measurable. Two strategies:
- Validate once at construction, skip during training. This is the typical pattern.
- Use
validate_graph(g, strict=False)and check the returned issues programmatically. Useful in batched contexts where you want to count problems instead of raising.
Both are documented in docs/api_stability.md in the TGraphX repository.
Beyond validate_graph: leakage and split policy
For supervised learning, two other utilities catch subtler bugs:
tgx.check_leakage(train_mask, val_mask, test_mask, strict=True)
tgx.validate_split_policy(train_mask, val_mask, test_mask, policy="random")
check_leakage catches the classic mistake of having the same node ID in train and test masks. validate_split_policy checks that the split satisfies a declared policy (random, by-node, by-graph, etc.).
These are not necessary for every project but are useful in research code where the split logic is fragile and easy to break during refactoring.
Dashboard-level audits
For multi-run experiments, the framework provides directory-level audit utilities:
print(tgx.audit_run_dir("runs/exp_001"))
# → {"ok": True, "files_present": [...], "missing": [], "warnings": []}
print(tgx.dashboard_audit("runs"))
# → {"ok": True, "run_count": 12, "issues": []}
These run after experiments to confirm artifacts are intact and ready for the dashboard or for publication.
Practical recommendation
Add three lines at the top of your data preparation:
g = tgx.Graph(x=x, edge_index=edge_index, labels=labels)
tgx.validate_graph(g, strict=True)
tgx.check_leakage(train_mask, val_mask, test_mask, strict=True)
These three lines have saved real research time in actual TGraphX projects. Adopt them as a default.
FAQ
Q: Does validation slow down training?
A: No. Validation runs once at construction, not during the training loop.
Q: Can I skip validation for very large graphs?
A: Yes. Use strict=False to get a structured result without raising, or skip the call entirely after you trust the data pipeline.
Q: What does check_graph_invariants add over validate_graph?
A: check_graph_invariants also verifies edge attribute alignment, metadata structure, and optional invariants like graph connectedness if requested.
Q: Will TGraphX validate automatically when I construct a graph?
A: No, it does not validate by default. You have to call tgx.validate_graph() explicitly. This is intentional: validation has a cost and the framework lets you choose when to pay it.
Q: What about debugging the model itself, not just data?
A: Use tgx.debug_batch(batch) and tgx.batch_summary(batch) for inspecting a mini-batch's structure during training. They print human-readable summaries of node/edge counts and feature shapes.