GNN Training with Graph Summarization (MLSys 2025 Poster)
Table of Contents
Problem #
Real-world graphs keep growing, making single-machine GNN training hard. While faster samplers help, we target the root cause: graph size. Many large graphs contain redundant connections. We define semi-metric edges as edges that don’t contribute to any shortest path within cycles. Removing these edges reduces work without hurting reachability-based structure.
Our Solution #
Building on our MLSys 2025 poster, we show that removing a small number of non-essential edges has minimal accuracy impact while speeding up training and lowering memory use. We’re developing an efficient preprocessing pipeline that removes semi-metric edges—edges not on any shortest path—so any downstream GNN can train on the summarized graph.
Implementation (in progress) #
- C++ core that loads NumPy-format graph datasets into memory
- Parallel summarization to detect and drop semi-metric edges
- Pybind-based Python API that returns a NumPy graph for direct integration in training pipelines
This approach benefits both server-scale training and low-end edge devices by shrinking graphs up front.