Research Integrity in Graph Learning: What Benchmarks Don't Tell You
Graph neural network benchmarks have a credibility problem. The same dataset appears in dozens of papers with widely varying reported accuracies, often without enough detail to reconcile the differences. New models routinely claim "state of the art" with a single seed run and a hyperparameter sweep that conveniently lands on the best number.
This note is not a takedown of any specific paper. It is about the evaluation practices that produce inflated numbers, and what to look for when reading benchmark claims.
The most common inflation sources
After reading many GNN papers and trying to reproduce a fair number of them, the patterns are clear:
Single-seed reporting. A model trained once with seed 42 may have an accuracy two points higher than the median across 10 seeds. Reporting the best of three runs without disclosing variance is the most common form of cherry-picking.
Massive hyperparameter sweeps without disclosing the search budget. A model that beats a baseline by 0.5% after exploring 1000 hyperparameter combinations has not beaten the baseline. The baseline did not get the same compute. The honest comparison is mean-over-best-seeds for both, with the same hyperparameter search budget.
Unfiltered evaluation in KG completion. Reporting raw ranks instead of filtered ranks makes models look much weaker than they are. The reverse — reporting filtered when describing the method as raw — makes them look stronger. Both happen.
Split choices that don't match standard practice. "We use a 90/10 train/test split" on a dataset that other papers split 70/30 produces inflated numbers because training on more data helps. If the comparison table cites other papers' numbers under the standard split, this is a problem.
Data leakage through preprocessing. Test set features computed using global statistics (e.g., normalizing using all data) leak information into training. This is subtle and often unintentional.
Reporting on the wrong validation criterion. Selecting hyperparameters on the test set instead of a held-out validation set is the textbook example of overfitting to the benchmark.
What good benchmark reporting looks like
A defensible benchmark result includes:
- At least 5 seeded runs, reported as mean ± standard deviation.
- The hyperparameter search budget (how many combinations were tried, how many seeds per combination).
- The exact split used and a citation if standard.
- The filtered/unfiltered choice for KG metrics explicitly stated.
- The evaluation protocol (e.g., random negative sampling vs all-other-entities).
- A configuration file or code commit hash that exactly reproduces the result.
If a paper reports a single number with no variance and no methodology details, the number is unreliable.
What TGraphX provides
The framework includes a write_benchmark_results() metadata writer that captures most of the above automatically:
import tgraphx as tgx
from tgraphx.tracking import write_benchmark_results
with tgx.reproducible(seed=42, deterministic=True):
result = tgx.classify_nodes(
x=x, edge_index=edge_index, labels=y,
model="tensor_gcn",
)
write_benchmark_results(
out_dir="runs/exp_001",
metrics=result.metrics,
seeds=[42, 43, 44, 45, 46],
hyperparameters={"lr": 2e-3, "hidden_dim": 64},
split={"strategy": "random", "train_pct": 70, "val_pct": 15, "test_pct": 15},
)
This writes a benchmark_results.json with system fingerprint, hyperparameters, split methodology, all seed results, and aggregate statistics. The file is intended as supplementary material — the kind of artifact reviewers can use to assess reproducibility.
This is tooling, not enforcement. Nothing prevents a researcher from running a single seed and reporting it. The point is that the tooling makes good reporting easier than bad reporting, which over time tends to nudge practice in the right direction.
What this changes about how to read papers
When evaluating a GNN paper's claims, look for:
- Variance. Is mean ± std reported? If not, the number is one run.
- Splits. Match the cited baselines exactly. Otherwise comparison is invalid.
- Compute budget. Was the search budget for the new model comparable to baselines?
- Filtered vs raw. Especially in KG papers — these can differ by 5-15 percentage points.
- Code release. Is there a public commit and configuration that produces the headline number?
A paper that does all five well is far more trustworthy than a paper with a higher number that does not.
On "state of the art"
The phrase is overused. State of the art on what dataset, with what evaluation protocol, with what compute budget, with what variance? Without those, the phrase is marketing.
The pragmatic alternative: claim what your model does well, acknowledge what it does not, and let the comparison stand on disclosed methodology.
Practical recommendations for new GNN researchers
- Always run at least 3 seeds. 5 is better.
- Use the same hyperparameter search budget for your method and the baselines.
- Report variance. Always.
- If your dataset has a standard split, use it. If you depart, justify and document.
- Save a per-run artifact with code commit, config, environment, metrics. The framework's tracking module does this; if you write your own loop, replicate the pattern.
- When you cannot reproduce a published baseline within a reasonable tolerance, say so in your paper.
The field's benchmark culture is a community problem, but individual researchers can hold themselves to a higher standard. The artifacts the TGraphX framework produces are designed to make that standard the path of least resistance.
FAQ
Q: What variance is acceptable for a published result?
A: There is no universal answer. For typical GNN benchmarks, std of ~1-2 percentage points is common. Anything reported without std is a single run.
Q: Should I rerun all baselines for my paper?
A: Ideally yes, with the same protocol and compute budget. If not feasible, cite the original numbers exactly and disclose the protocol differences.
Q: Does the framework prevent benchmark inflation?
A: No. It provides tools that make good reporting easier. The discipline is the researcher's.
Q: What about model size? Should I report parameter counts?
A: Yes. Two models with very different parameter counts are not directly comparable on accuracy alone.
Q: Where can I find the benchmark artifact format documentation?
A: See docs/benchmarks.md in the TGraphX repository.