We build and optimize AI systems, working on everything from GPU kernels to distributed training.
Latest Work
→ Porting CUDA FFT to Mojo: Achieving Bit-Exact Precision
→ Optimizing AlphaFold's Triangle Multiplicative Update: A First Look at GPU Performance Engineering
→ Multi-GPU Programming with AMD's Iris Framework for Triton
→ Gluon: When Triton Isn't Low-Level Enough
→ The Hidden Math Bug That Makes AI Unpredictable
→ Building Agents for Small Language Models: A Deep Dive into Lightweight AI
→ AMD GPU Support in Triton Gluon Framework
→ RustBPE: High-Performance BPE Tokenizer Training in Rust