MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

Abstract

Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer. We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.

Method

The overall pipeline of MagicTryOn. The input includes person videos, pose representations, clothing-agnostic masks, and target garment images. Videos and poses are encoded into agnostic and pose latents by the Wan Video Encoder, while masks are resized into mask latents. These, combined with random noise, are fed into the DiT backbone. Meanwhile, garment images yield multi-level features, including text, CLIP, garment, and line tokens. The garment token provides coarse guidance via sequence concatenation, and all tokens are injected into DiT blocks for fine-grained conditioning. After n denoising steps, the DiT backbone produces try-on latents, decoded into video by the Wan Video Decoder.

Experiments

Try-on results in large motion scenarios. Virtual try-on with large body movements—such as dancing—is particularly challenging, as it requires not only garment consistency but also spatiotemporal coherence. To evaluate performance in such cases, we select two dancing videos from Pexels website.

Try-on results in the multi-garment scenario.

Try-on results in the doll scenario.

Qualitative comparison of image virtual try-on results on the VITON-HD (1-st and 2-nd row) and DressCode (3-rd row) datasets.

MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

Abstract

Method

Experiments

BibTeX