TGraphX Insights How TGraphX Handles Benchmark Disclaimers Automatically
← Back to Insights

How TGraphX Handles Benchmark Disclaimers Automatically

Target keyword: GNN benchmark disclaimer reproducibility

How TGraphX Handles Benchmark Disclaimers Automatically

Most GNN benchmark numbers in published papers are reported as bare accuracies, sometimes with a single standard deviation, occasionally with a configuration file but rarely with full provenance. When the same dataset appears in many papers, the numbers diverge in ways that are hard to explain without the missing context.

TGraphX includes a benchmark artifact system designed to record the missing context automatically. This article explains what gets recorded and why each piece matters.

The provenance gap

Two researchers benchmark the same model on Cora. They get 84.1% and 86.7%. Both are run "with the published hyperparameters." Both used the standard 140/500/1000 split. What explains the 2.6% gap?

It could be:

  • Different cuDNN version → different floating-point reduction order
  • Different sampler seeding → different mini-batches
  • Different feature normalization → different gradient magnitudes
  • Different validation criterion (best-val vs last-epoch)
  • Different number of seeds (best-of-3 vs mean-of-3)
  • Different hyperparameter search budget

Each of these is invisible in a paper's results table. Reviewers cannot adjudicate. Reproducers cannot diagnose. The numbers are reported as if they are precise scientific measurements, but they are not.

The fix is to record the provenance comprehensively as part of every run, so the missing context exists somewhere even if it is not in the paper.

What TGraphX records

The framework's tracking module includes a write_benchmark_results() function that captures:

python
from tgraphx.tracking import write_benchmark_results
        
        write_benchmark_results(
            out_dir="runs/exp_001",
            metrics=result.metrics,
            seeds=[42, 43, 44, 45, 46],
            hyperparameters={
                "lr": 2e-3, "hidden_dim": 64, "num_layers": 2,
                "dropout": 0.5, "weight_decay": 5e-4,
            },
            split={
                "strategy": "standard_cora",
                "train_pct": 5.2, "val_pct": 18.5, "test_pct": 37.0,
            },
            sampler={"type": "full_batch"},
            notes="Standard Cora split. 5 seeds, mean ± std reported.",
        )
        

The resulting benchmark_results.json contains:

  • System fingerprint. Python version, PyTorch version, CUDA version, GPU model, CPU model, OS.
  • Package versions. Hash of pip freeze for reproducibility verification.
  • Run-level seed and configuration. Per-seed metrics, mean, std, min, max.
  • Hyperparameters. Exact values used.
  • Split methodology. Strategy name, percentages, RNG seed for any split-time randomness.
  • Sampler details. Loader type and parameters.
  • Evaluation protocol. Validation criterion, filtered/raw flag for KG, etc.

This is the kind of metadata that should be supplementary material for any benchmark publication.

Automatic capture

For one-call workflows, the runner captures most of this automatically:

python
import tgraphx as tgx
        
        with tgx.reproducible(seed=42, deterministic=True):
            result = tgx.classify_nodes(
                x=x, edge_index=ei, labels=y,
                model="tensor_gcn", seed=42,
            )
        
        # Inspect what was written
        print(tgx.audit_run_dir(result.run_dir))
        

audit_run_dir returns a summary of what artifacts exist: run_metadata.json, experiment_config.json, metrics.csv, reproducibility_report.json, and, if the user called the tracking writer, benchmark_results.json.

If something is missing, the audit lists it. This is useful as a pre-publication checklist — you can confirm the artifacts are complete before submitting.

The honest limit of what artifacts can do

Comprehensive artifacts make reproduction easier. They do not guarantee it. A reviewer who downloads your artifacts and reruns your code can verify:

  • Was the code as published?
  • Was the configuration as reported?
  • Do the metrics match the table?

They cannot easily verify:

  • Was the dataset preprocessed identically?
  • Was the hardware family similar?
  • Is the random seed actually controlling all the randomness?

For deeper reproducibility, you also need a public code repository, a clear description of preprocessing, and a containerized environment (Docker, conda lock file) so the runtime is identical.

The framework provides the tooling to make good practice easy. The discipline is still the researcher's.

A model benchmark report

A defensible benchmark report from a paper using TGraphX might look like:

We trained tensor_gcn on cifar10_patch with tgx.classify_nodes() using
pip install tgraphx==1.4.2. Hyperparameters: learning rate 2e-3, hidden
dim 64, 2 layers, dropout 0.5. Standard 70/15/15 random split with seed 42.
Five training seeds (42-46). All artifacts at github.com/[user]/[repo]/runs.
Mean test accuracy: 78.4% ± 1.1% (5 seeds).

That paragraph has every piece a reviewer needs to assess and reproduce the result. The artifacts in the linked repository can be downloaded to verify the run.

This is the standard the framework's tooling is designed to make easy. It is the standard the field should hold itself to.

When you should NOT report a benchmark

  • A single seed run. Report it as a "preliminary single-seed result" or do not report it.
  • A model selected on test set performance. This is overfitting to the benchmark and should not be reported.
  • A model with a hyperparameter budget that the baselines did not have. Unfair comparison.
  • A result you cannot reproduce yourself on the same machine. If you can't reproduce it, neither can anyone else.

The benchmark culture in ML rewards numbers over rigor. The artifact system is one small lever toward shifting that balance. Individual researchers using it consistently can model the practice the field should adopt.


FAQ

Q: Does write_benchmark_results need to be called for every run?
A: No. The TGraphX experiment runner calls it automatically. Manual calls are useful when you write your own training loop.

Q: What is in reproducibility_report.json?
A: System fingerprint, seed, deterministic flag, RNG library versions, hardware info. See docs/reproducibility.md in the TGraphX repo.

Q: Where do these artifacts get written?
A: Per-run subdirectories under runs/ by default. The location is configurable via the experiment runner's out_dir parameter.

Q: Can I include benchmark artifacts in a Git repository?
A: Yes, but they can be large. Many projects use Git LFS for the larger ones (e.g., model checkpoints) and keep the metadata JSON files in regular Git.

Q: How do I include the benchmark report in a paper?
A: The artifact format is JSON. Convert relevant fields to a LaTeX table or markdown summary for inclusion. See docs/benchmarks.md for examples.