My recent focus is on GPU performance and LLM inference systems, especially transformer serving efficiency, KV-cache design, batching strategies, prefill/decode trade-offs, tensor parallelism, roofline analysis, and distributed communication.
Research & Projects
Current Focus
LLM Inference Systems & Performance Notes
GPU
LLM
Systems
- Built a structured technical repository on LLM inference systems, GPU performance, transformer serving, roofline analysis, and distributed communication.
- Documented and analyzed trade-offs across KV-cache layout, paged KV cache, continuous batching, chunked prefill, tensor parallelism, and llama2.cpp runtime behavior.
- Organized implementation-oriented notes to connect systems-level bottlenecks with end-to-end inference efficiency.
Harness Engineering for Human-in-the-Loop CUDA Kernel Optimization
CUDA
Profiling
BF16 GEMM
Performance
- Engineered a profiling-driven, human-in-the-loop CUDA matmul optimization harness for a fixed BF16 GEMM on an RTX 3070 Laptop GPU, with correctness-gated benchmarking and structured iteration.
- Reduced a shape-specialized custom kernel from 802.8 ms to 24.2 ms; outperformed the local CUTLASS baseline of 25.9 ms by about 7% and reached 91.1% of the best local cuBLAS result of 22.0 ms.
Selected Research
Vehicle-PERCH for Outdoor Vehicle Detection
Robotics
3D Perception
Vehicle Detection
- Proposed Vehicle-PERCH, a 3D vehicle detection framework that estimates vehicle 3D pose through an analysis-by-synthesis pipeline. The method integrates 2D and 3D information and provides real-time capability.
- Applied unsupervised clustering with Gaussian mixture models to separate vehicles into twelve categories based on vehicle size information, then constructed a dozen vehicle 3D models, including microcar, sedan, compact car, and SUV.
- Evaluated the method on the KITTI dataset. Results show that Vehicle-PERCH achieves 3D detection and localization performance on par with state-of-the-art learning-based methods without using 3D pose annotation data.
- Submitted to ICRA 2021.
Indoor Object 6-DOF Pose Estimation
GPU
Robotics
Pose Estimation
- Studied PERCH, a perception-via-search family of algorithms that renders scenes with different object poses and searches for the best explanation of the observed scene while accounting for occlusion.
- Studied space-rotation formalisms and used object geometric symmetry to reduce redundant rotation proposals, achieving an algorithm speedup of over 50%.
- Tested on the YCB dataset. Results show that the algorithm surpasses state-of-the-art 6-DOF pose estimation methods by a large margin without requiring any ground-truth pose annotations.
Featured Publication
Aditya Agarwal, Yupeng Han, and Maxim Likhachev, "PERCH 2.0: Fast and Accurate GPU-based Perception via Search for Object Pose Estimation," IEEE International Conference on Intelligent Robots and Systems 2020
Modeling and Analysis of Complex System
Modeling
Systems
Analysis
- Addressed the difficulty that service seekers face when choosing among many service providers, and improved on the first-in, first-out matching mechanism by developing a stable matching system based on utility theory.
- Generated preference lists for service providers and service seekers based on different utility interests and studied the optimal matching frequency for repeated matching.
Featured Publication
Thekinen J., Yupeng Han, and Panchal J. H., "Designing market thickness and optimal frequency of multi-period stable matching in CBDM," ASME International Design Engineering Technical Conferences Computers and Information in Engineering Conference 2018 [PDF]