Greetings! I am a PhD student in Computer Science at Cornell Tech, NYC, advised by Hadar Averbuch-Elor. My research interests are in generative models and multimodal computer vision, with a current interest in controllable generation.
Previously, I was an AI Research Resident at VinAI, where I was fortunate to work under the mentorship of Dr. Anh Tran. Before that, I received my Bachelor of Computer Science from Ho Chi Minh City University of Science (HCMUS) in Vietnam.
During my PhD study, I am fortunate to intern at Apple AI/ML.
Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object’s overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision–language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
@inproceedings{sella2026proxe,booktitle={Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},title={Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions},author={Etai Sella and Hao Phung and Nitay Amiel and Or Litany and Or Patashnik and Hadar Averbuch-Elor},year={2026},published={true},}
Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements–such as rooms, windows, and doors–are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
@inproceedings{phung2026raster2seq,booktitle={Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},title={Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction},author={Hao Phung and Hadar Averbuch-Elor},year={2026},published={true},}
Simple Guidance Mechanisms for Discrete Diffusion Models
Yair Schiff*, Subham Sekhar Sahoo*, Hao Phung*, Guanghan Wang*, Sam Boshar, Hugo Dalla-torre, Bernardo P Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov
In The Thirteenth International Conference on Learning Representations, 2025
Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation. Code is available at https://github.com/kuleshov-group/discrete-diffusion-guidance.
@inproceedings{schiff2024discreteguidance,booktitle={The Thirteenth International Conference on Learning Representations},title={Simple Guidance Mechanisms for Discrete Diffusion Models},author={Yair Schiff and Subham Sekhar Sahoo and Hao Phung and Guanghan Wang and Sam Boshar and Hugo Dalla-torre and Bernardo P de Almeida and Alexander Rush and Thomas Pierrot and Volodymyr Kuleshov},year={2025},published={true},}
Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at https://github.com/VinAIResearch/SCFlow.
@inproceedings{dao2024scflow,title={Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Image Generation},author={Quan Dao and Hao Phung and Trung Dao and Dimitris N. Metaxas and Anh Tran},booktitle={The 39th Annual AAAI Conference on Artificial Intelligence (AAAI)},year={2025},published={true},}
We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at https://github.com/VinAIResearch/DiMSUM.
@inproceedings{phung2024dimsum,title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},author={Hao Phung and Quan Dao and Trung Dao and Hoang Phan and Dimitris N. Metaxas and Anh Tran},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},year={2024},published={true},}
Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user’s image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing. Our code will be available at https://github.com/VinAIResearch/Anti-DreamBooth.
@inproceedings{le_etal2023antidreambooth,title={Anti-DreamBooth: Protecting users from personalized text-to-image synthesis},author={Thanh Van Le and Hao Phung and Thuan Hoang Nguyen and Quan Dao and Ngoc Tran and Anh Tran},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},year={2023},published={true},}
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models’ running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints will be available at https://github.com/VinAIResearch/WaveDiff.git.
@inproceedings{phung2023wavelet,title={Wavelet Diffusion Models are fast and scalable Image Generators},author={Hao Phung and Quan Dao and Anh Tran},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},year={2023},published={true},pages={10199-10208},}
Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object’s overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision–language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
@inproceedings{sella2026proxe,booktitle={Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},title={Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions},author={Etai Sella and Hao Phung and Nitay Amiel and Or Litany and Or Patashnik and Hadar Averbuch-Elor},year={2026},published={true},}
Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements–such as rooms, windows, and doors–are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
@inproceedings{phung2026raster2seq,booktitle={Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},title={Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction},author={Hao Phung and Hadar Averbuch-Elor},year={2026},published={true},}
Non-contrast chest CTs offer a rich opportunity for both conventional pulmonary and opportunistic extra-pulmonary screening. While Multi-Task Learning (MTL) can unify these diverse tasks, standard hard-parameter sharing approaches are often suboptimal for modeling distinct pathologies. We propose HyperCT, a framework that dynamically adapts a Vision Transformer backbone via a Hypernetwork. To ensure computational efficiency, we integrate Low-Rank Adaptation (LoRA), allowing the model to regress task-specific low-rank weight updates rather than full parameters. Validated on a large-scale dataset of radiological and cardiological tasks, HyperCT outperforms various strong baselines, offering a unified, parameter-efficient solution for holistic patient assessment. Our code will be made publicly available.
@inproceedings{liu2026hyperct,booktitle={Medical Imaging with Deep Learning},title={Hyper{CT}: Low-Rank Hypernet for Unified Chest {CT} Analysis},author={Fengbei Liu and Sunwoo Kwak and Hao Phung and Nusrat Binta Nizam and Ilan Richter and Nir Uriel and Hadar Averbuch-Elor and Deborah Estrin and Mert R. Sabuncu},year={2026},published={true},}
Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network at every denoising step and incur high computational cost. Our key insight is that discrete diffusion models perform two types of computation: 1) representing clean tokens and 2) denoising corrupted tokens, which enables us to use separate modules for each task. We propose an encoder-decoder architecture to accelerate discrete diffusion inference, which relies on an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence. We also show that this architecture enables faster training of block diffusion models, which partition sequences into blocks for better quality and are commonly used in diffusion language model inference. We introduce a framework for Efficient Encoder-Decoder Diffusion (E2D2), consisting of an architecture with specialized training and sampling algorithms, and we show that E2D2 achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks.
@inproceedings{arriola2025e2d2,booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},title={Encoder-Decoder Diffusion Language Models for Efficient Training and Inference},author={Marianne Arriola and Yair Schiff and Hao Phung and Aaron Gokaslan and Volodymyr Kuleshov},year={2025},published={true},}
Simple Guidance Mechanisms for Discrete Diffusion Models
Yair Schiff*, Subham Sekhar Sahoo*, Hao Phung*, Guanghan Wang*, Sam Boshar, Hugo Dalla-torre, Bernardo P Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov
In The Thirteenth International Conference on Learning Representations, 2025
Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation. Code is available at https://github.com/kuleshov-group/discrete-diffusion-guidance.
@inproceedings{schiff2024discreteguidance,booktitle={The Thirteenth International Conference on Learning Representations},title={Simple Guidance Mechanisms for Discrete Diffusion Models},author={Yair Schiff and Subham Sekhar Sahoo and Hao Phung and Guanghan Wang and Sam Boshar and Hugo Dalla-torre and Bernardo P de Almeida and Alexander Rush and Thomas Pierrot and Volodymyr Kuleshov},year={2025},published={true},}
Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at https://github.com/VinAIResearch/SCFlow.
@inproceedings{dao2024scflow,title={Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Image Generation},author={Quan Dao and Hao Phung and Trung Dao and Dimitris N. Metaxas and Anh Tran},booktitle={The 39th Annual AAAI Conference on Artificial Intelligence (AAAI)},year={2025},published={true},}
We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at https://github.com/VinAIResearch/DiMSUM.
@inproceedings{phung2024dimsum,title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},author={Hao Phung and Quan Dao and Trung Dao and Hoang Phan and Dimitris N. Metaxas and Anh Tran},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},year={2024},published={true},}
Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latent-based generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective.
@article{dao2023lfm,title={Flow Matching in Latent Space},author={Quan Dao and Hao Phung and Binh Nguyen and Anh Tran},journal={arXiv preprint arXiv:2307.08698},year={2023},published={false},}
Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user’s image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing. Our code will be available at https://github.com/VinAIResearch/Anti-DreamBooth.
@inproceedings{le_etal2023antidreambooth,title={Anti-DreamBooth: Protecting users from personalized text-to-image synthesis},author={Thanh Van Le and Hao Phung and Thuan Hoang Nguyen and Quan Dao and Ngoc Tran and Anh Tran},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},year={2023},published={true},}
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models’ running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints will be available at https://github.com/VinAIResearch/WaveDiff.git.
@inproceedings{phung2023wavelet,title={Wavelet Diffusion Models are fast and scalable Image Generators},author={Hao Phung and Quan Dao and Anh Tran},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},year={2023},published={true},pages={10199-10208},}
NICS
VQASTO: Visual Question Answering System for Action Surveillance based on Task Ontology
Question answering (QA) is a popular research topic for its applications in reality. In advance, there are Visual Question Answering (VQA) researches that aim to combine visual and textual information for question answering. Their drawback is the dependence of learning models, which impedes human intervention and interpretation. To the best of our knowledge, most of them concentrate on the general problem or some specific contexts but no one puts the QA system under action surveillance context. In this paper, we propose a QA system based on Task Ontology which is mainly responsible for mapping from a question sentence to corresponding tasks carried out to reach the appropriate answer. The advantages of task ontology are the adoption of human knowledge to solve a specific problem and the reusability. The performance of the system thus heavily depends on subtasks/models. In our scope, we focus on two main subtasks: Pose estimation/tracking and Skeleton-based action recognition. Besides, we give some enhancements to improve the time efficiency of Pose estimation/tracking and propose a new spatial-temporal feature based on drawing skeleton sequence to image for skeleton-based action recognition of videos in the wild. This method, to some extent, can overcome the challenge of bad-shape pose/skeleton produced by Pose estimation on real-world videos. The hard part of Action recognition in VQASTO is that it has to get input from Pose estimation and Pose tracking which is markedly different from having available good skeletons and merely do recognition.
@inproceedings{vo2020vqasto,author={Huy Quoc Vo and Tien-Hao Phung and Ngoc Quoc Ly},booktitle={2020 7th NAFOSTED Conference on Information and Computer Science (NICS)},title={VQASTO: Visual Question Answering System for Action Surveillance based on Task Ontology},year={2020},volume={},number={},pages={273-279},doi={10.1109/NICS51282.2020.9335891},published={true},}
Contact: hao (at) cs (dot) cornell (dot) edu | tienhaophung (at) gmail (dot) com