Graph Reinforcement Learning with TGraphX: A Practical Introduction
Graph reinforcement learning (graph RL) addresses problems where the action space is defined on a graph: choose an edge to add, a node to visit, a vertex to color. Classic combinatorial problems — MaxCut, vertex cover, graph coloring, routing — fit this framework naturally, and RL approaches to them have become an active research area.
TGraphX includes a graph RL subsystem with 13 algorithms across 9 environments. This article walks through how to use it, with the important caveat that the entire subsystem is labeled Experimental. Expect rough edges, evaluate carefully before drawing strong conclusions, and treat any benchmark numbers it produces as preliminary.
What the subsystem includes
Environments shipped with the framework:
GraphNavigationEnv— navigate from source to target nodeGraphColoringEnv— assign colors to minimize conflictsMaxCutEnv— partition nodes to maximize cut edgesShortestPathEnv— find shortest pathVertexCoverEnv— minimum vertex coverGraphGenerationEnv— sequential graph generationKGPathReasoningEnv— KG path findingContinuousNavigationEnvandContinuousGraphEditEnv— continuous-action variants
Algorithms:
- Discrete: REINFORCE, Actor-Critic, A2C, DQN, Double DQN, Dueling DQN, PPO
- Continuous: DDPG, Delayed DDPG, TD3, SAC
- Baselines: Random, Greedy
A first experiment
The one-call API runs a complete training:
import tgraphx as tgx
result = tgx.train_graph_rl(
env="maxcut",
algorithm="dqn",
num_nodes=20,
episodes=100,
seed=42,
)
print(result.metrics) # {"final_return": ..., "best_return": ..., ...}
print(result.history[-5:]) # last 5 episode returns
This trains a DQN agent on randomly generated 20-node MaxCut instances for 100 episodes and returns a result object with metrics and training history.
Comparing algorithms
For a small environment, you can compare multiple algorithms quickly:
import tgraphx as tgx
algorithms = ["random", "greedy", "dqn", "ppo"]
results = {}
for alg in algorithms:
res = tgx.train_graph_rl(
env="maxcut", algorithm=alg,
num_nodes=15, episodes=50, seed=42,
)
results[alg] = res.metrics["best_return"]
for alg, score in sorted(results.items(), key=lambda x: -x[1]):
print(f" {alg:10s}: {score:.2f}")
Random and greedy serve as baselines. For 50 episodes on a tiny problem, do not expect the learned algorithms to dominate the heuristic baselines. They become competitive at larger scales and with more training.
Lower-level API
The one-call function is convenient but hides everything. For research, drop to the lower-level API:
from tgraphx.rl import DQNAgent, MaxCutEnv, GraphPolicyNetwork
env = MaxCutEnv(num_nodes=20, seed=42)
agent = DQNAgent(
state_dim=env.observation_space.shape[0],
action_dim=env.action_space.n,
hidden_dim=128,
lr=1e-3,
)
for episode in range(100):
state = env.reset()
done = False
while not done:
action = agent.select_action(state)
next_state, reward, done, info = env.step(action)
agent.store_transition(state, action, reward, next_state, done)
agent.train_step()
state = next_state
This is closer to standard RL code and gives you control over the training loop, exploration strategy, and replay buffer.
A continuous-action example: SAC on continuous graph navigation
result = tgx.train_graph_rl(
env="continuous_navigation",
algorithm="sac",
num_nodes=30,
episodes=200,
seed=42,
)
The continuous environments use vector observations with continuous action spaces. SAC and TD3 are appropriate; DQN is not.
What works well
- The discrete environments and algorithms are well-integrated and produce sensible learning curves for small problems.
- The baselines (random, greedy) are useful sanity checks.
- The one-call API is suitable for quick experimentation and demos.
- Results are reproducible with
tgx.reproducible(seed=42)wrapping the training.
What does not work as well (yet)
- Scaling to large graphs (1000+ nodes) is rough. The default policy networks were not designed for very large action spaces.
- Continuous environments are less tested than discrete ones.
- Hyperparameter defaults are general-purpose, not tuned for any specific benchmark.
- No published large-scale comparisons against established RL libraries (Stable Baselines 3, RLlib).
For research where you compare against published graph RL benchmarks, the established RL libraries combined with custom environments may produce better and more comparable results. TGraphX's graph RL is best for exploring problem formulations and rapid prototyping.
When this subsystem is the right fit
- You are exploring graph RL formulations and want a single library with environments and algorithms.
- You need a quick baseline for a combinatorial problem with graph structure.
- You are teaching graph RL and want students to focus on the formulation, not on environment plumbing.
When it is not
- Production-grade graph RL deployment.
- Benchmarking against published numbers requiring precise hyperparameters.
- Very large action spaces or continuous control with fine reward shaping.
The framework's stability documentation in docs/api_stability.md is honest about the Experimental label. Treat it as such.
A note on reward shaping
For combinatorial problems, dense reward signals (per-step improvement) train much faster than sparse ones (terminal reward only). Most environments in TGraphX support both via an intermediate_reward parameter. Start with intermediate rewards on; turn them off only if you specifically need sparse-reward research.
FAQ
Q: Are the environments Gymnasium-compatible?
A: They follow the same API conventions but are not registered as Gymnasium environments. You can wrap them with a Gymnasium adapter if needed.
Q: Can I bring my own environment?
A: Yes. Subclass tgraphx.rl.GraphEnv and implement reset() and step(). Pass it to any agent that accepts the environment.
Q: What about multi-agent RL?
A: Not currently supported as a first-class abstraction. You can simulate multi-agent by running multiple environments and agents in parallel.
Q: Do the DQN variants support prioritized replay?
A: Standard DQN uses uniform replay. Prioritized replay is on the roadmap but not currently shipped.
Q: Is there a documented benchmark comparison against PPO-Clip from SB3?
A: No. Comparison against external RL libraries is left to the user; the framework's focus is on graph RL formulations and not on competing with general-purpose RL libraries.