GNN Training with Graph Summarization (MLSys 2025 Poster)

Table of Contents

Problem #

Real-world graphs keep growing, making single-machine GNN training hard. While faster samplers help, we target the root cause: graph size. Many large graphs contain redundant connections. We define semi-metric edges as edges that don’t contribute to any shortest path within cycles. Removing these edges reduces work without hurting reachability-based structure.

Our Solution #

Building on our MLSys 2025 poster, we show that removing a small number of non-essential edges has minimal accuracy impact while speeding up training and lowering memory use. We’re developing an efficient preprocessing pipeline that removes semi-metric edges—edges not on any shortest path—so any downstream GNN can train on the summarized graph.

Implementation (in progress) #

C++ core that loads NumPy-format graph datasets into memory
Parallel summarization to detect and drop semi-metric edges
Pybind-based Python API that returns a NumPy graph for direct integration in training pipelines

This approach benefits both server-scale training and low-end edge devices by shrinking graphs up front.