Tutorial · June 01, 2026 · 5 min read

Graph Reinforcement Learning with TGraphX: A Practical Introduction

Target keyword: graph reinforcement learning pytorch

Graph Reinforcement Learning with TGraphX: A Practical Introduction

Graph reinforcement learning (graph RL) addresses problems where the action space is defined on a graph: choose an edge to add, a node to visit, a vertex to color. Classic combinatorial problems — MaxCut, vertex cover, graph coloring, routing — fit this framework naturally, and RL approaches to them have become an active research area.

TGraphX includes a graph RL subsystem with 13 algorithms across 9 environments. This article walks through how to use it, with the important caveat that the entire subsystem is labeled Experimental. Expect rough edges, evaluate carefully before drawing strong conclusions, and treat any benchmark numbers it produces as preliminary.

What the subsystem includes

Environments shipped with the framework:

GraphNavigationEnv — navigate from source to target node
GraphColoringEnv — assign colors to minimize conflicts
MaxCutEnv — partition nodes to maximize cut edges
ShortestPathEnv — find shortest path
VertexCoverEnv — minimum vertex cover
GraphGenerationEnv — sequential graph generation
KGPathReasoningEnv — KG path finding
ContinuousNavigationEnv and ContinuousGraphEditEnv — continuous-action variants

Algorithms:

Discrete: REINFORCE, Actor-Critic, A2C, DQN, Double DQN, Dueling DQN, PPO
Continuous: DDPG, Delayed DDPG, TD3, SAC
Baselines: Random, Greedy

A first experiment

The one-call API runs a complete training:

python

import tgraphx as tgx
        
        result = tgx.train_graph_rl(
            env="maxcut",
            algorithm="dqn",
            num_nodes=20,
            episodes=100,
            seed=42,
        )
        
        print(result.metrics)        # {"final_return": ..., "best_return": ..., ...}
        print(result.history[-5:])   # last 5 episode returns

This trains a DQN agent on randomly generated 20-node MaxCut instances for 100 episodes and returns a result object with metrics and training history.

Comparing algorithms

For a small environment, you can compare multiple algorithms quickly:

python

import tgraphx as tgx
        
        algorithms = ["random", "greedy", "dqn", "ppo"]
        results = {}
        
        for alg in algorithms:
            res = tgx.train_graph_rl(
                env="maxcut", algorithm=alg,
                num_nodes=15, episodes=50, seed=42,
            )
            results[alg] = res.metrics["best_return"]
        
        for alg, score in sorted(results.items(), key=lambda x: -x[1]):
            print(f"  {alg:10s}: {score:.2f}")

Random and greedy serve as baselines. For 50 episodes on a tiny problem, do not expect the learned algorithms to dominate the heuristic baselines. They become competitive at larger scales and with more training.

Lower-level API

The one-call function is convenient but hides everything. For research, drop to the lower-level API:

python

from tgraphx.rl import DQNAgent, MaxCutEnv, GraphPolicyNetwork
        
        env = MaxCutEnv(num_nodes=20, seed=42)
        agent = DQNAgent(
            state_dim=env.observation_space.shape[0],
            action_dim=env.action_space.n,
            hidden_dim=128,
            lr=1e-3,
        )
        
        for episode in range(100):
            state = env.reset()
            done = False
            while not done:
                action = agent.select_action(state)
                next_state, reward, done, info = env.step(action)
                agent.store_transition(state, action, reward, next_state, done)
                agent.train_step()
                state = next_state

This is closer to standard RL code and gives you control over the training loop, exploration strategy, and replay buffer.

A continuous-action example: SAC on continuous graph navigation

python

result = tgx.train_graph_rl(
            env="continuous_navigation",
            algorithm="sac",
            num_nodes=30,
            episodes=200,
            seed=42,
        )

The continuous environments use vector observations with continuous action spaces. SAC and TD3 are appropriate; DQN is not.

What works well

The discrete environments and algorithms are well-integrated and produce sensible learning curves for small problems.
The baselines (random, greedy) are useful sanity checks.
The one-call API is suitable for quick experimentation and demos.
Results are reproducible with tgx.reproducible(seed=42) wrapping the training.

What does not work as well (yet)

Scaling to large graphs (1000+ nodes) is rough. The default policy networks were not designed for very large action spaces.
Continuous environments are less tested than discrete ones.
Hyperparameter defaults are general-purpose, not tuned for any specific benchmark.
No published large-scale comparisons against established RL libraries (Stable Baselines 3, RLlib).

For research where you compare against published graph RL benchmarks, the established RL libraries combined with custom environments may produce better and more comparable results. TGraphX's graph RL is best for exploring problem formulations and rapid prototyping.

When this subsystem is the right fit

You are exploring graph RL formulations and want a single library with environments and algorithms.
You need a quick baseline for a combinatorial problem with graph structure.
You are teaching graph RL and want students to focus on the formulation, not on environment plumbing.

When it is not

Production-grade graph RL deployment.
Benchmarking against published numbers requiring precise hyperparameters.
Very large action spaces or continuous control with fine reward shaping.

The framework's stability documentation in docs/api_stability.md is honest about the Experimental label. Treat it as such.

A note on reward shaping

For combinatorial problems, dense reward signals (per-step improvement) train much faster than sparse ones (terminal reward only). Most environments in TGraphX support both via an intermediate_reward parameter. Start with intermediate rewards on; turn them off only if you specifically need sparse-reward research.

FAQ

Q: Are the environments Gymnasium-compatible?
A: They follow the same API conventions but are not registered as Gymnasium environments. You can wrap them with a Gymnasium adapter if needed.

Q: Can I bring my own environment?
A: Yes. Subclass tgraphx.rl.GraphEnv and implement reset() and step(). Pass it to any agent that accepts the environment.

Q: What about multi-agent RL?
A: Not currently supported as a first-class abstraction. You can simulate multi-agent by running multiple environments and agents in parallel.

Q: Do the DQN variants support prioritized replay?
A: Standard DQN uses uniform replay. Prioritized replay is on the roadmap but not currently shipped.

Q: Is there a documented benchmark comparison against PPO-Clip from SB3?
A: No. Comparison against external RL libraries is left to the user; the framework's focus is on graph RL formulations and not on competing with general-purpose RL libraries.