TGraphX Insights Research Reproducibility in GNN Projects: Lessons from TGraphX
← Back to Insights

Research Reproducibility in GNN Projects: Lessons from TGraphX

Target keyword: GNN research reproducibility pytorch

Research Reproducibility in GNN Projects: Lessons from TGraphX

Try reproducing a published GNN paper that uses PyTorch. Even with the right code, the right dataset, and the right hyperparameters, you will often see metrics that drift across runs by a few points. Sometimes a lot. This note is about why, and what tooling actually helps.

What "reproducible" needs to mean

A reproducible experiment is one where:

  1. The same code, with the same data and the same seeds, on similar hardware, produces the same numbers.
  2. A reader can audit how the experiment was run from artifacts left behind.
  3. Variation across runs is bounded by intentional sources (e.g. dropout) and not by hidden non-determinism.

Most GNN papers do not meet (1). Many do not meet (2). Almost none provide enough artifacts to make (3) checkable.

Hidden sources of non-determinism

Setting torch.manual_seed(42) is not enough. Other RNGs and non-deterministic ops are at play:

  • Numpy and Python random. Many data loaders, augmentations, and sampling functions use these. They are not seeded by torch.manual_seed.
  • CUDA non-determinism. Some CUDA kernels are non-deterministic by default (e.g., scatter operations). torch.use_deterministic_algorithms(True) forces deterministic variants, often at a speed cost.
  • cuDNN heuristics. With torch.backends.cudnn.benchmark = True, cuDNN selects algorithms based on input shape — different runs can pick different algorithms.
  • DataLoader workers. With num_workers > 0, worker startup order affects which data is seen in which order unless you seed each worker.
  • Graph sampling. Random walks, neighbor sampling, and negative sampling all use RNG that needs to be seeded.
  • Hardware. Floating-point operations on GPU produce slightly different results across GPU models due to fused-multiply-add ordering.

For a paper to be reproducible, all of these need to be controlled — or acknowledged in the methodology.

What TGraphX provides

The framework includes a reproducible() context manager and a set_seed() utility that addresses the seeding part comprehensively:

python
import tgraphx as tgx
        
        with tgx.reproducible(seed=42, deterministic=True):
            result = tgx.classify_nodes(x=x, edge_index=ei, labels=y)
        

What this does:

  • Seeds Python's random, NumPy, PyTorch CPU RNG, PyTorch CUDA RNG (per device).
  • Sets torch.backends.cudnn.deterministic = True and benchmark = False.
  • Calls torch.use_deterministic_algorithms(True).
  • Writes a reproducibility_report.json alongside other run artifacts when used inside the TGraphX experiment runner.

The reproducibility_report.json records the system fingerprint: Python version, PyTorch version, CUDA version, GPU model, seed, deterministic flag, and a hash of installed package versions.

What the framework does NOT solve

tgx.reproducible() controls the things it has control over. It cannot:

  • Guarantee bit-identical results across different GPU models or CUDA versions.
  • Eliminate floating-point non-associativity in reductions on GPU.
  • Reproduce results from code that uses external libraries with their own RNG state.

These are fundamental properties of the hardware and broader ecosystem. The pragmatic standard for "reproducible" in GNN research is: same code, same data, same hardware family — same numbers within a small bounded tolerance.

A practical reproducibility checklist

Before publishing or sharing a GNN result, check:

  • Seeds are set for Python, NumPy, PyTorch, CUDA, and any data loaders.
  • torch.use_deterministic_algorithms(True) is called or the methodology explicitly acknowledges non-deterministic kernels.
  • The exact package versions are recorded (e.g., pip freeze > requirements_lock.txt).
  • The data preprocessing pipeline is in version control.
  • A run-level artifact is generated that captures system info, hyperparameters, and metrics.
  • The model checkpoint is small enough or hashed for verification.
  • At minimum 3 seeded runs are reported with mean and standard deviation.

The TGraphX experiment runner produces most of these artifacts automatically. If you write your own training loop, replicate the pattern.

Code snippet: a fully audit-able run

python
import tgraphx as tgx
        from tgraphx import set_seed
        
        set_seed(42, deterministic=True)
        
        with tgx.reproducible(seed=42, deterministic=True):
            g = tgx.Graph(x=x, edge_index=ei, labels=y)
            tgx.validate_graph(g, strict=True)
            result = tgx.classify_nodes(
                x=x, edge_index=ei, labels=y,
                model="tensor_gcn", seed=42, device="auto",
            )
        
        # Inspect the run directory
        print(tgx.audit_run_dir(result.run_dir))
        

audit_run_dir returns a summary of which expected artifacts are present and which are missing — a useful pre-submission check.

On benchmark claims

If you cannot reproduce your own result three times on your own hardware, do not publish it as a benchmark number. Use the variance across runs as the headline, not the best of three.

This is not a TGraphX-specific principle. It is a basic standard the field should adopt more consistently, and one that the framework's tooling makes easier to follow.


FAQ

Q: Does deterministic=True always work?
A: Not always. Some PyTorch operations have no deterministic implementation and will raise a RuntimeError when you enable strict determinism. The warning message points to the operation and a recommended workaround.

Q: Why does my run still differ slightly across machines?
A: GPU floating-point reductions are not associative. Different GPU models produce slightly different sums for the same operation. This is a hardware fact, not a software bug.

Q: Is there a recommended way to report variance?
A: Run with at least 3 different seeds (5 is better) and report mean ± standard deviation. Most TGraphX example scripts demonstrate this pattern.

Q: Where do reproducibility artifacts get written?
A: When you use the experiment runner or tgx.classify_nodes(), artifacts are written to a per-run directory under runs/. The reproducibility_report.json and experiment_config.json are at the root of that directory.

Q: What if my data preprocessing is non-deterministic?
A: Cache the preprocessed data with a hash of the inputs and seed. Use the cached version in subsequent runs. Document the preprocessing seed.