Skip to main content

About Me

·4 mins
🌟 Excited to create real impact next summer (2026)!
I’m seeking an internship where I can contribute, learn, and grow at the intersection of systems and AI.
πŸ‘‰ Get in touch here


πŸ“š Education #

Boston University β†—
πŸ“ PhD Student in Computer Science
πŸ—“ Sep 2024 – Present Β· πŸ“ Boston, MA

Boston University β†—
πŸŽ“ Master of Science in Computer Science β€” GPA: 3.96/4.0
πŸ—“ Sep 2022 – May 2024 Β· πŸ“ Boston, MA

South China University of Technology β†—
πŸŽ“ Bachelor of Engineering in Software Engineering
πŸ—“ Sep 2017 – Jun 2021 Β· πŸ“ Guangzhou, China


🧰 Work & Internship #

Intel Asia-Pacific Research & Development Ltd β€” AI Framework Engineer β†—
πŸ—“ Aug 2021 – Aug 2022 Β· πŸ“ Shanghai, China

  • Contributed to the oneDNN Graph Compiler with fusion, loop rescheduling, inlining, and SIMD intrinsics, delivering measurable speedups in deep learning workloads across AVX512/AVX512-BF16 platforms.
  • Implemented a range of elementwise and graph operators including normalization and VNNI reordering, developed graph patterns such as MHA to identify and analyze performance bottlenecks.

Tencent – WeChat Group β€” Software Engineer Intern β†—
πŸ—“ Jun 2020 – Sep 2020 Β· πŸ“ Guangzhou, China

  • Built an internal test case management system using Python Tornado + React.
  • Developed Grafana dashboards for internal statistical data visualization.

πŸ“– Research & Publications #

[Paper] RingSampler: GNN Sampling on Large-Scale Graphs with io_uring β€” HotStorage ’25 β†—
πŸ—“ Jul 2025 (Second Author) Β· πŸ”— ACM Link

  • Introduced an io_uring-based out-of-core sampling method for graphs larger than memory.
  • Fully utilized CPU resources while achieving near in-memory sampling performance.

[Poster] GNN Training with Graph Summarization β€” MLSys ’25 β†—
πŸ—“ May 2025

  • Explored graph summarization as preprocessing to shrink graph size before training.
  • Demonstrated lower sampling time vs. random-access baseline and significant PCIe bandwidth savings.

SmartSSD Code Repository β†—
πŸ—“ Dec 2024 Β· Β GitHub

  • Published an evolved implementation of SmartSSD-based GNN sampling.
  • Refactored host-side design for clearer results and usability.

[Paper] In Situ Neighborhood Sampling for Large-Scale GNN Training β€” DaMoN ’24
πŸ—“ Jun 2024 Β· πŸ”— ACM Link

  • Proposed a SmartSSD-based streaming sampler for efficient GNN training.
  • Demonstrated lower sampling time vs. random-access baseline and significant PCIe bandwidth savings.

[Master’s Thesis] Accelerating Large-Scale GNN Training with Programmable SSDs
πŸ—“ May 2024 Β· πŸ“ Boston University Β· πŸ”— Thesis Link

  • Designed FPGA kernels with Vitis HLS to offload computation onto SSD and reduce PCIe transfers.
  • Compared preprocessing costs, file layouts, and sampling speeds on Papers100M and Yahoo datasets.

πŸ”¬ Projects #

Advanced Matrix Multiplication β€” Strassen Algorithm β†—
πŸ—“ Feb 2023 – May 2023 Β· πŸ“ Boston, MA

  • Implemented and optimized the Strassen algorithm for fast matrix multiplication.
  • Leveraged OpenMP parallelism and SIMD intrinsics to accelerate computation.
  • Conducted detailed performance analysis and compared against baseline algorithms.

Fine-Tuning Pretrained Vision Transformers β†—
πŸ—“ Feb 2023 – May 2023 Β· πŸ“ Boston, MA

  • Fine-tuned NAT, DiNAT, MaxViT, and DaViT models on the iNaturalist dataset.
  • Achieved competitive accuracy compared to state-of-the-art models.
  • Analyzed architectural differences and compared fine-tuning vs. training from scratch.

Online Signature Verification (CNN + GRU) β†—
πŸ—“ Jan 2021 – June 2021 Β· πŸ“ Guangzhou, China

  • Extracted static features from signature images via a CNN-based autoencoder.
  • Captured dynamic features from signature trajectories using a GRU-based autoencoder.
  • Applied early/late fusion and a CNN-based Siamese network to verify authenticity.

Parallel Group-By Operation (MPI + OpenMP) β†—
πŸ—“ Mar 2020 - June 2020 Β· πŸ“ Guangzhou, China

  • Simulated the β€œGroup-By” operation of databases on large datasets.
  • Used OpenMP for node-level parallelism and MPI for distributed cross-node processing.
  • Achieved multi-fold efficiency improvements over serial implementations.

Short Video Copyright Detection β†—
πŸ—“ Sep 2019 – Nov 2019 Β· πŸ“ Guangzhou, China

  • Designed an approximate detection system to find short clips within long videos.
  • Extracted features with SIFT and indexed them using FAISS vector database.
  • Applied hierarchical matching + sliding window for accurate detection within 5s tolerance.

UWP-Based Property Management System β†—
πŸ—“ Jun 2019 – Sep 2019 Β· πŸ“ Guangzhou, China

  • Built a frontend with C++ and UWP and a backend with Python/Django.
  • Designed and implemented RESTful APIs for client-server communication.
  • Delivered a complete property management solution for internal use.

πŸ† Honors & Awards #

  • 1st Scholarship, SCUT β€” Awarded for academic excellence (Oct 2020) β†—
  • 2nd Prize, CCF NOIP (Nov 2015) β€” Prestigious algorithm competition akin to ACM-ICPC β†—

πŸ“± Contact #