Design of Streaming Sampler

Table of Contents

Relative Posts #

Preprocessing #

Our proposed streaming sampler requires preprocessing the edge files into chunks. Below is the format of one chunk:

[Number of Nodes] [Start Node ID] [Offsets] [Neighbor of Node1] [Neighbor of Node2] ...

Number of nodes: Describes the total number of frontiers in this chunk.
Start Node ID: As every chunk contains only part of the edge information in sorted order, we need to record the start node ID in each chunk. This way, when we look up this chunk, we know which part of the edge is contained in this chunk.
Offsets: Pointers to the first neighbor of each node.
Neighbors: All the edges are stored here. The edges are ascending sorted by their frontiers.

For example, consider the following chunk:

[200], [1000], [1003], [1007], [1011], [1013], …[9000], [0, 1, 2, 3], [4, 5, 6, 7], [8, 9], [12, 13, 14],…

From this, we can infer that this chunk has edge information for 200 nodes. The first node in this chunk’s ID is 1000. To look up the neighbors of node 1001, we refer to offsets 1 and 2, inferring that the neighbor of node 1001 is stored between chunk[1003] (inclusive) and chunk[1007] (exclusive). Then, we can jump to this part of the chunk to look up the node information.

During preprocessing, our script aims to fit as many nodes as possible below the maximum chunk size. In our design, we use a chunk size of 512 MB, which can store 128M unsigned integers. When the remaining space in a chunk cannot store one more node, we fill it up with 0s and start a new chunk. Additionally, the FPGA P2P read requires that each read size is a multiple of 512 bytes. Therefore, for the last chunk, we pad the size to a multiple of 512 bytes.

It is evident that a higher chunk size leads to better chunk utilization and also reduces the total overhead of reading. However, the size of the chunk is still limited by the memory of the FPGA, as we will store both the chunk and its sampled result in the FPGA.

Here are the padded chunk informations:

start	end	n_nodes	size(bytes)	padding	fill up
0	10521746	123695935	536870740	43	100.00%
10521747	16564523	128167549	536841316	7399	99.99%
16564524	21486461	129295773	536870856	14	100.00%
21486462	26399870	129304299	536870844	17	100.00%
26399871	31310495	129306910	536870152	190	100.00%
31310496	36218482	129309715	536870820	23	100.00%
36218483	41128311	129307864	536870784	32	100.00%
41128312	45803992	129540167	536863404	1877	100.00%
45803993	49503592	130518090	536870772	35	100.00%
49503593	53191139	130530149	536870796	29	100.00%
53191140	63352413	124056435	536870848	16	100.00%
63352414	80917735	116652065	536869560	338	100.00%
80917736	111059954	86000921	464572572	89	99.99%

We also store the start ID of each node in a separate file as the chunk information. This data is loaded with the sampler in the host program to split the target nodes into chunks.

Initialization #

When initializing the streaming sampler, we will:

Open the XRT devices and load the XRT kernel.
Allocate buffer objects for each device, which will be used to store the edges, target nodes, and sampled results. All these buffer objects are allocated with the maximum possible size that they need to use.
Create a mapping of the above buffer objects to host memory.
Open the edge file and store the file handler.
Load the chunk information into memory, along with all the training nodes.

APIs #

In DGL, the customized sampler only exposes the initializer and sample() method to the public. However, we need to expose more APIs to make our sampler functional.

Initializer #

The initializer will do the initialization process described above.

sample() #

The sample() method will take a list of nodes as the frontier, assemble the sampled result from memory, and return them according to the framework’s requirements.

sampleNextEpoch() #

Since we are sampling once per epoch, not once per batch, this additional method is necessary to instruct the sampler to prepare for the next epoch’s sample results from disk and transfer them into memory. This process does not need to occur at the beginning of an epoch. For example, we can trigger the next epoch’s sampling when we are 80% through the current epoch, allowing us to overlap the sampling time with the training time. This method will also be called asynchronously when we initialize the sampler so that we can prepare the sample result for the first layer.

newEpochStart() #

There will be sample results from two epochs existing simultaneously in memory. Once an epoch finishes, this function will be called so that we can switch to the next epoch’s sample results.

FPGA kernel #

Our FPGA kernel accepts the following arguments:

Edge chunk buffer object
This buffer object is used to store the chunk of the preprocessed edge files. We use Peer-to-Peer (P2P) reading to transfer part of edge file from SSD into this buffer object in the FPGA, without sending it to the host.
Sample result buffer object
This buffer object is used to store the sampled result. Once a run is finished, the contents of this buffer object will be transferred to the host.
Target nodes buffer object
This buffor object is used to store the target nodes that we will use as fontier to sample with current chunk. We transfer them from memory.
Number of the target nodes
Since the target nodes buffer object is only a pointer, we also need to pass the total number of target nodes into the FPGA.
Fanout
This is the number of neighbors that we are going to sample.

After preparing the above arguments, the FPGA kernel is triggered asynchronously. For each kernel run, the FPGA kernel will perform the following steps to sample its neighbors:

Find its edge offset information, based on the target node ID and the start node ID of this chunk.
Subtract the offsets of its neighbor’s start and end, obtaining the node’s degree.
If the degree is less than the desired number of samples, then copy all its neighbors to the sampled result and fill up the remaining space with -1. Otherwise, generate a list of offsets between its neighbor’s start and end offsets, according to the fanout. Then, copy the corresponding sampled results to the result buffer object.

The FPGA kernel is unrolled by a factor of 32 to utilize FPGA parallelization.

Host program #

Multi-layer sampling #

Our FPGA program is designed to operate on a chunk-by-chunk basis. Thus, for each FPGA run, we need to send edge chunks along with their corresponding frontiers. Before sampling each layer, it’s necessary to divide the frontiers into chunks. Since we’ve stored the start node ID of each chunk in a separate file during preprocessing, we now only need to divide the frontiers according to this chunk information.

We will store the frontier of each sampling in a separate vector for future subgraph reconstruction.

After obtaining the split frontiers, we transfer them into the FPGA memory using a buffer object and initiate the P2P read for the FPGA to load a chunk of the edge file from the SSD into FPGA memory.

Once the transfer is complete, we activate the FPGA kernel to perform the sampling. After it’s done, we transfer the sampled results stored in the result buffer object into the memory.

We then make a copy of the sampled results, remove all the -1s (which indicate a nonexistent neighbor), and sort it. This sorted copy will serve as the frontier for the next layer. We repeat this process until the sampling of all layers is completed.

Assembling Sampled Results #

As we sample the results for an entire epoch, when we receive a query for sampling the next minibatch, we need to assemble the result from the whole epoch’s sampled results stored in memory.

For each target node, we use binary search to locate its position in all the frontiers. Then, we can calculate the offsets of its sampled neighbors by multiplying its index by the fanout for that layer. We will only copy the non-negative results, which represent the real neighbor nodes. Since we’ve also kept the sorted frontier for the next layer, we can use binary search to find the index of each neighbor node in the next layer’s frontier, repeating this process to calculate the offsets of its neighbors, until we assembled all the layers of neighbors.