GarmentDiffusion: 3D Garment Pattern Generation with Multi-modal Diffusion Transformers

1Zhejiang University    2Shenfu Research
IJCAI 2025

Abstract

Garment sewing patterns are fundamental design elements that bridge the gap between garment design and practical manufacturing. The generative modeling of sewing patterns is crucial for creating diverse and innovative garment designs. However, existing approaches are limited either by reliance on a single input modality or by suboptimal generation efficiency. In this work, we present GarmentDiffusion, a new generative model capable of producing centimeter-precise, vectorized 3D sewing patterns from multi-modal inputs (text, image, and incomplete sewing pattern). Our method efficiently encodes 3D sewing pattern parameters into compact edge token representations, achieving a sequence length 10× shorter than that of the previous autoregressive modeling approach, i.e., DressCode. By employing a diffusion transformer, we simultaneously denoise all edge tokens along the temporal axis, while maintaining a constant number of denoising steps regardless of dataset-specific edge and panel statistics. We achieve new state-of-the-art results on the largest parametric sewing pattern dataset, namely GarmentCodeData.

🪄 Tokenization


MY ALT TEXT

we encode edge-related parameters along the embedding dimension, while denoising all edge tokens in parallel along the temporal axis. After rotating and translating the panels into 3D space, each edge is represented by its starting endpoint, control point(s), arc parameters, stitch tag, and stitch flag. The edges are arranged into a panel using zero-padding. Each edge is assigned an edge-order index and a panel-order index to indicate its position within the sequence and its association with the corresponding panel.



🏠 Network architecture


MY ALT TEXT

During training, the edge sequence with injected noise is encoded and combined with embeddings of edge-order indices, panel-order indices, and timesteps before being fed into the transformer network for prediction. After embedding the text and image conditions, decoupled cross-attention is used to compute the attention outputs separately while sharing the query. The results are then summed to achieve joint control of both text and image conditions. Automatic pattern completion is achieved by noise replacement with user-provided panel during inference.



🎆 Results demo

BibTeX

@misc{li2025garmentdiffusion3dgarmentsewing,
        title={GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers}, 
        author={Xinyu Li and Qi Yao and Yuanda Wang},
        year={2025},
        eprint={2504.21476},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2504.21476}, 
  }