His research interests focus on the development of efficient algorithms and systems for deep learning, specifically large foundation models. His work has received over 8000 stars on GitHub. His work has a real-world impact: SmoothQuant has been integrated into NVIDIA's TensorRT-LLM, FasterTransformer and Intel's NeuralCompressor and is utilized in the LLMs of industry companies like Amazon, Meta, and Huggingface. StreamingLLM has been integrated into NVIDIA's TensorRT-LLM, Huggingface's transformers, and Intels' Extension for Transformers.