GarmentDiffusion

Abstract

Garment sewing patterns are fundamental design elements that bridge the gap between design concepts and practical manufacturing. The generative modeling of sewing patterns is crucial for creating diversified garments. However, existing approaches are limited either by reliance on a single input modality or by suboptimal generation efficiency. In this work, we present GarmentDiffusion, a new generative model capable of producing centimeter-precise, vectorized 3D sewing patterns from multimodal inputs (text, image, and incomplete sewing pattern). Our method efficiently encodes 3D sewing pattern parameters into compact edge token representations, achieving a sequence length that is 10 times shorter than that of the autoregressive SewingGPT in DressCode. By employing a diffusion transformer, we simultaneously denoise all edge tokens along the temporal axis, while maintaining a constant number of denoising steps regardless of dataset-specific edge and panel statistics. With all combination of designs of our model, the sewing pattern generation speed is accelerated by 100 times compared to SewingGPT. We achieve new state-of-the-art results on DressCodeData, as well as on the largest sewing pattern dataset, namely GarmentCodeData.

🪄 Tokenization

we encode edge-related parameters along the embedding dimension, while denoising all edge tokens in parallel along the temporal axis. After rotating and translating the panels into 3D space, each edge is represented by its starting endpoint, control point(s), arc parameters, stitch tag, and stitch flag. The edges are arranged into a panel using zero-padding. Each edge is assigned an edge-order index and a panel-order index to indicate its position within the sequence and its association with the corresponding panel.

🏠 Network architecture

During training, the edge sequence with injected noise is encoded and combined with embeddings of edge-order indices, panel-order indices, and timesteps before being fed into the transformer network for prediction. After embedding the text and image conditions, decoupled cross-attention is used to compute the attention outputs separately while sharing the query. The results are then summed to achieve joint control of both text and image conditions. Automatic pattern completion is achieved by noise replacement with user-provided panel during inference.

BibTeX

@misc{li2025garmentdiffusion3dgarmentsewing, title={GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers}, author={Xinyu Li and Qi Yao and Yuanda Wang}, year={2025}, eprint={2504.21476}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.21476}, }

GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers

Abstract

🪄 Tokenization

🏠 Network architecture

🎆 Results demo

BibTeX