GPU Performance Engineer | CUDA, Low-Latency Systems, LLM Inference

Yupeng Han

GPU performance engineer focused on CUDA, low-latency systems, and large-scale compute optimization. Recently expanding into LLM inference systems, with emphasis on transformer serving, KV-cache trade-offs, batching, roofline analysis, and distributed communication.

Resume GitHub LLM Inference Notes Harness Engineering for Human-in-the-Loop CUDA Kernel Optimization Email

Experience

Staff Software Engineer

Plus AI

Senior GPU Engineer

EBots

R&D Engineer

Trifo

Research Engineer

CMU Robotics Institute

Featured

Harness Engineering for Human-in-the-Loop CUDA Kernel Optimization

CUDA | BF16 GEMM | RTX 3070 Laptop GPU

CUDA Profiling BF16 GEMM Performance

Engineered a profiling-driven, human-in-the-loop CUDA matmul optimization harness for a fixed BF16 GEMM on an RTX 3070 Laptop GPU, with correctness-gated benchmarking and structured iteration.
Reduced a shape-specialized custom kernel from 802.8 ms to 24.2 ms; outperformed the local CUTLASS baseline of 25.9 ms by about 7% and reached 91.1% of the best local cuBLAS result of 22.0 ms.

GitHub LinkedIn Post

SJTU Outstanding Individual

Recognition featured by SJTU Academic News.

View Article