Graph Mining with TGraphX: Motif Discovery and Structural Patterns
Graph mining is the set of techniques for discovering structural patterns in a graph: motifs (small recurring subgraph patterns), centrality measures (which nodes are important), and structural features (color refinement, WL features, anonymous walks). These are classical graph analytics that predate deep learning by decades but remain useful as features for graph learning, as analysis tools, and as inputs to downstream models.
TGraphX includes a graph mining subsystem in tgraphx.mining. This article walks through what it provides and how it fits alongside dedicated mining libraries.
What the subsystem includes
The mining module covers:
- Motif counting. Find and count small subgraph patterns.
- Centrality measures. Degree, betweenness (approximate), closeness, PageRank, eigenvector.
- Weisfeiler-Lehman (WL) features. Color-refinement-based graph features for graph classification.
- Structural role identification. Approximate equivalence classes for nodes based on structural similarity.
- Anonymous walks. Distribution of anonymized random walks of fixed length.
The implementations are PyTorch-friendly (most operate on tgx.Graph and return tensors) but are not generally as fast as specialized C++ libraries like NetworKit or graph-tool. For research-scale graphs (up to a few hundred thousand nodes), the framework's implementations are usually fast enough.
A first motif example
import tgraphx as tgx
from tgraphx.mining import count_motifs
# Construct a graph
g = tgx.generate_graph("ba", num_nodes=100, m=3, seed=42)
# Count triangles, 4-paths, etc.
counts = count_motifs(g, motif_sizes=[3, 4])
print(counts)
# {3: {"triangle": 134, ...}, 4: {"4-path": 89, "4-cycle": 12, ...}}
The exact set of motifs depends on motif_sizes. For size 3, there is essentially one motif (the triangle). For size 4, there are several (4-cycle, 4-path, claw, etc.). For size 5+, the number of distinct motifs grows quickly and counting becomes expensive.
For very large graphs, exact motif counting is intractable; the framework includes sampling-based approximations.
Centrality measures
from tgraphx.mining import degree_centrality, betweenness_centrality, pagerank
g = tgx.generate_graph("ba", num_nodes=100, m=3, seed=42)
degrees = degree_centrality(g) # [N]
between = betweenness_centrality(g, k=50) # approximate, sampling 50 source nodes
ranks = pagerank(g, alpha=0.85) # [N]
print(f"Top-5 by degree: {degrees.argsort(descending=True)[:5].tolist()}")
print(f"Top-5 by betweenness: {between.argsort(descending=True)[:5].tolist()}")
print(f"Top-5 by PageRank: {ranks.argsort(descending=True)[:5].tolist()}")
These return PyTorch tensors so you can use them as node features in a GNN:
features = torch.stack([degrees, between, ranks], dim=1) # [N, 3]
g.node_features = torch.cat([g.node_features, features], dim=1)
This is a common pattern in graph learning: enrich the raw node features with classical centrality measures so the GNN does not have to learn them from scratch.
Weisfeiler-Lehman features
WL features are color-refinement-based: at each iteration, every node's color is updated based on the multiset of its neighbors' colors. After a few iterations, the histogram of colors is a fingerprint of the graph's local structure.
from tgraphx.mining import wl_features
g = tgx.generate_graph("ba", num_nodes=100, m=3, seed=42)
features = wl_features(g, num_iterations=3)
print(features.shape) # [num_distinct_colors]
WL features are useful for graph classification benchmarks where structural patterns matter more than node feature values. They are interpretable, deterministic, and fast.
Comparison to NetworKit and graph-tool
For dedicated graph mining:
- NetworKit (Python bindings over C++) implements motif counting, centrality, community detection, and more. For graphs with millions of nodes, it is much faster than NetworkX or TGraphX.
- graph-tool (C++ with Python bindings) is similarly fast and has excellent visualization.
TGraphX's mining is intended for medium-scale research graphs where the integration with PyTorch is more valuable than raw speed. For mining-heavy research on very large graphs, NetworKit is usually a better fit.
When TGraphX's mining is the right fit
- You want mining results as PyTorch tensors for use in downstream models.
- You are working with research-scale graphs (up to ~100K nodes).
- You want a single Python library for both mining and learning.
When dedicated mining libraries are better
- Very large graphs (>1M nodes).
- Mining-only workflows that do not need PyTorch integration.
- Need for specific algorithms not in TGraphX (e.g., Louvain community detection).
A practical workflow: mining-as-features
import tgraphx as tgx
from tgraphx.mining import degree_centrality, pagerank, wl_features
import torch
g = tgx.Graph(x=raw_features, edge_index=edge_index, labels=y)
tgx.validate_graph(g, strict=True)
# Compute classical features
deg = degree_centrality(g)
pr = pagerank(g, alpha=0.85)
# Stack as additional node features
classical_features = torch.stack([deg, pr], dim=1)
g.node_features = torch.cat([g.node_features, classical_features], dim=1)
# Train
result = tgx.classify_nodes(
x=g.node_features,
edge_index=g.edge_index,
labels=g.node_labels,
model="gcn",
seed=42,
)
This pattern is well-established in graph learning research: GNN models often benefit from explicit structural features, especially when the raw node features are sparse or low-dimensional.
Honest limitations
- The exact motif counting routines do not scale beyond moderate graph sizes.
- Approximate betweenness centrality samples
ksource nodes; for smallkthe result is noisy. - The mining module's API is stable but is a smaller subset than NetworKit's.
- No graph isomorphism algorithm is shipped; for isomorphism-based research, use NetworkX's VF2.
Reproducibility
All sampling-based mining algorithms accept a seed argument. Wrap experiments in the standard reproducibility context:
with tgx.reproducible(seed=42, deterministic=False):
counts = count_motifs(g, motif_sizes=[3, 4], sample_size=1000, seed=42)
FAQ
Q: What size motifs can I count?
A: Sizes 3 and 4 are well-supported and reasonable on graphs up to ~10K edges. Size 5 is supported but slow. Size 6+ requires sampling.
Q: Is there a community detection algorithm?
A: Not natively. Use NetworkX's Louvain or NetworKit's faster equivalent.
Q: Can I use WL features as a kernel?
A: Yes. WL features can be compared with cosine similarity or any kernel function. For SVM-with-WL-kernel, you can compute WL features in TGraphX and pass them to scikit-learn.
Q: How does graph mining interact with mini-batch training?
A: Mining is typically computed once per graph as preprocessing. For mini-batch training on subgraphs, you can compute mining features on the full graph and gather per-batch.
Q: Where can I find more examples?
A: The TGraphX repository's examples/graph_mining_structural_demo.py shows the basic patterns.