Greetings! I am a first year PhD student in Computer Science at Cornell Tech, NYC. Previously, I was an AI Research Resident at VinAI, where I was fortunate to work under the mentorship of Dr. Anh Tran. Before that, I received my Bachelor of Computer Science from Ho Chi Minh City University of Science (HCMUS) in Vietnam.
My primary research interests focus on generative models, with a particular emphasis on diffusion models.
Simple Guidance Mechanisms for Discrete Diffusion Models
Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov
Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.
@inproceedings{schiff2024discreteguidance,booktitle={The Thirteenth International Conference on Learning Representations},title={Simple Guidance Mechanisms for Discrete Diffusion Models},author={Schiff, Yair and Sahoo, Subham Sekhar and Phung, Hao and Wang, Guanghan and Boshar, Sam and Dalla-torre, Hugo and de Almeida, Bernardo P and Rush, Alexander and Pierrot, Thomas and Kuleshov, Volodymyr},year={2025},}
Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Image Generation
Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at https://github.com/VinAIResearch/SCFlow.
@inproceedings{dao2024scflow,title={Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Image Generation},author={Dao*, Quan and Phung*, Hao and Dao, Trung and Metaxas, Dimitris N. and Tran, Anh},booktitle={The 39th Annual AAAI Conference on Artificial Intelligence (AAAI)},year={2025},}
DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation
We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at https://github.com/VinAIResearch/DiMSUM.
@inproceedings{phung2024dimsum,title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},author={Phung*, Hao and Dao*, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris N. and Tran, Anh},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},year={2024},}
Anti-DreamBooth: Protecting users from personalized text-to-image synthesis
Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user’s image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing. Our code will be available at https://github.com/VinAIResearch/Anti-DreamBooth.
@inproceedings{le_etal2023antidreambooth,title={Anti-DreamBooth: Protecting users from personalized text-to-image synthesis},author={Le*, Thanh Van and Phung*, Hao and Nguyen*, Thuan Hoang and Dao, Quan and Tran, Ngoc and Tran, Anh},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},year={2023},}
Wavelet Diffusion Models are fast and scalable Image Generators
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models’ running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints will be available at https://github.com/VinAIResearch/WaveDiff.git.
@inproceedings{phung2023wavelet,title={Wavelet Diffusion Models are fast and scalable Image Generators},author={Phung*, Hao and Dao*, Quan and Tran, Anh},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},year={2023},}
Contact: hao (at) cs (dot) cornell (dot) edu | tienhaophung (at) gmail (dot) com