Relative Posts

  1. Overview
  2. Introduction to SmartSSD
  3. Installation and Environment Setup
  4. Design of Streaming Sampler
  5. Design of Random read sampler(This Post)

The main idea of the random read sampler is to generate random offsets according to the frontier’s edge offsets and degrees on the host, and transfer only the necessary parts of the file from SmartSSD NAND to FPGA DRAM. This is done using SmartSSD’s FPGA to gather the exact data we want to read and transfer it back to the host.


The edge file, which will undergo the random read, needs to be sorted by the destination node ID and stored sequentially in the edge file. During preprocessing, we also need to generate an offset file and edge file. The edge file will be read directly by the FPGA during the sampling process, and both the edge file and the offset file will be loaded onto the host to generate random offsets. Additionally, as FPGA P2P reading strictly requires both reading size and reading offset to be a multiple of 512 bytes, we are padding zeros to the end of the file to make the file size a multiple of 512 bytes.

[Neighbors of Node1] [Neighbors of Node2] ... [Neighbor of Node2] [Padding 0s]
  • Neighbors: All the source nodes are stored here, sorted by their respective destination nodes.
  • Padding 0s: Pads the file size to a multiple of 512 bytes.

For example, after preprocessing, we could have the following edge file:

[1, 2, 3], [0, 4, 6, 7], [8, 9], [2, 3, 14],… [0, 0, 0, 0, 0]

Here, source nodes 1, 2, and 3 are connected to destination node 0, and source nodes 0, 4, 6, and 7 are connected to destination node 1, and so on. Five 0s are padded at the end to make the file size a multiple of 512 bytes. In the edge file, there are no separators, and all node IDs are stored in little endian binary format.


When initializing the streaming sampler, we will:

  • Open the XRT devices and load the XRT kernel.
  • Allocate buffer objects for each device, which will be used to store the raw data read P2P from SSD NAND to FPGA memory, as well as the buffers to store the offsets and sampled target nodes. All these buffer objects are allocated with the maximum possible size they might need to use.
  • Create a mapping of the above buffer objects to host memory.
  • Load the offsets and edge degree information into host memory.


In DGL, the customized sampler only exposes the initializer and sample() method to the public. And we will mainly need to expose this API to public.


The initializer creates a sampler instance, opens the XRT device, and sets the fanouts for later use.


The sample() method takes a list of nodes as the input frontier, samples the result layer by layer according to the fanouts, and returns the sampled result.

FPGA kernel

Our FPGA kernel accepts the following arguments:

  • Raw Sample Result

    The P2P transfer from SSD NAND to FPGA memory needs to be read as a multiple of 512 bytes. Therefore, we need to store the raw data read from SSD NAND in a buffer object for further extraction. Since one uint32 takes 4 bytes and we need to read 512 bytes for this uint32, the size of this buffer object is 128 times larger than the aggregated result.

  • Aggregrated Sample Result

    The FPGA kernel is responsible for extracting the exact sample result from the 512-byte reading chunk and storing the extracted result in a separate buffer object.

  • Offsets

    The FPGA P2P read requires the reading offset to also be a multiple of 512 bytes, so we can only read the nearest 512-byte chunk of our target offset. Therefore, the exact edge we want to read may not necessarily be the first integer of the chunk. Hence, we need to send separate offsets to the FPGA to aggregate the sample result.

We will read in an entire layer’s sampled result of a mini-batch and then trigger the FPGA kernel. The FPGA kernel will iterate according to the offsets array, retrieve the exact location we want according to the offsets, and copy the result from the raw sample result to the aggregated sample result. If it encounters an offset of -1, it indicates that the frontier does not have that many neighbors, so the FPGA kernel will write -1 to the aggregated sample result to occupy the pre-allocated space for its neighbors.

The P2P reading process uses the pread function and is parallelized by OpenMP to reduce SmartSSD NAND idle time. Additionally, the FPGA kernel is unrolled by a factor of 32 to utilize FPGA parallelization.

Host program

Random Offsets Generation

For this FPGA kernel, the host program is responsible for generating the offsets and completing the sampling process. The FPGA only retrieves the sampled result according to the address given by the host.

For each mini-batch, we will receive a list of frontiers. We look up the offset and degree information we pre-loaded into the memory to generate the sample result’s offsets. For nodes that have fewer or equal neighbors than the fanout, we will copy all their neighbors’ offsets into the offset list that will feed into the FPGA and fill up the remaining space with -1 to complete the preallocated sampled result part. For those nodes that have more neighbors than the required fanout, the host will generate a number of random offsets equal to the fanout, according to the degree information, and add them to the offset list for further extraction by the FPGA.

After completing the above sampling process for the entire layer of the current mini-batch, once the FPGA has finished reading from the SSD NAND, the host will transfer the offset array in memory to the FPGA DRAM and then trigger the FPGA kernel.

Retrieve the Aggregated Sample Result

Once the FPGA kernel has finished executing, the host program will transfer the aggregated sampled result from FPGA DRAM to host memory, copy all non-negative results as the final sample result, and also use them as the frontier if we need to sample the next layer.