MergeVQ: A Unified Framework for Visual Generation and Representation with Token Merging and Quantization

Siyuan Li1,3∗ Luyuan Zhang2∗ Zedong Wang4 Juanxi Tian3 Cheng Tan1,3 Zicheng Liu1,3 Chang Yu3 Qingsong Xie5† Haonan Lu5 Haoqian Wang2 Zhen Lei6,7,8†
1Zhejiang University 2Tsinghua University 3Westlake University 4The Hong Kong University of Science and Technology 5OPPO AI Center 6CAIR, HKISI-CAS 7Institute of Automation, CASIA 8University of Chinese Academy of Sciences

Abstract

Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in the shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ that incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment, and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for second-stage generation, we introduce MergeAR which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks, while maintaining favorable token efficiency and inference speed.

Introduction

Vector Quantization (VQ) has emerged as a widely adopted technique in visual tasks. Nevertheless, existing methods often encounter challenges that impede their effectiveness. In response, we present MergeVQ, a revolutionary solution designed to overcome these limitations. Our approach recognizes the complementary nature of representation learning and generation, and it leverages this synergy by decoupling and recovering semantics during the encoding and reconstruction processes.

MergeVQ introduction

The contributions of MergeVQ are multifaceted. Firstly, it effectively decouples semantic tokens and the source matrix, enabling a more refined and efficient representation of visual data. Secondly, it addresses the longstanding challenges associated with VQ training, leading to improved stability and performance. Finally, MergeVQ achieves state-of-the-art results in both visual generation and representation learning tasks, demonstrating its superiority over existing methods.

Learning Paradigm

3.1 MergeVQ Framework

The MergeVQ framework is a comprehensive solution that encompasses token merge encoding, quantization, token recovery, and reconstruction. Each component plays a crucial role in enabling efficient and effective visual generation and representation learning.

Token Merge Encoding: Given an image \(X \in \mathbb{R}^{H \times W \times 3}\), we employ a two - stage downsampling encoder \(\varepsilon\). First, a CNN layer \(\varepsilon_{1}\) extracts features, producing a feature map \(Z \in \mathbb{R}^{\frac{H}{f} \times \frac{W}{f} \times d}\), where \(f\) is the downsampling factor and \(d\) denotes the channel number. The feature \(Z\) is then flattened into an \(L\) - length token sequence \(Z_{L} \in \mathbb{R}^{L \times d}\) as \(Z_{L}=\mathcal{E}_{1}(X)\). Subsequently, we employ attention with the merging operation, denoted as \(\varepsilon_{2}\), for second - stage extraction. During this process, we obtain a shorter sequence \(Z_{K} \in \mathbb{R}^{K \times d}\) along with its source matrix \(S \in \mathbb{R}^{K \times L}\) that preserves the spatial relationships of the sequence. This process is expressed as \(Z_{K}, S=\mathcal{E}_{2}(Z_{L})\). To ensure that \(Z_{K}\) possesses rich semantics, we concurrently impose global alignment on \(Z_{K}\).

Quantization: We utilize \(LFQ\) as MergeVQ’s quantization module to minimize the loss of details. Specifically, the codebook is reduced to an integer set and can be denoted as: \(C = \times_{i = 1}^{N} \{-1, 1\}\) with \(|C| = 2^{N}\). Thus, the quantization can be summarized as follows: \(z_{K i} = \text{sign}(z_{K i}) = -1 \cdot \mathbb{I}(z_{K i} < 0) + \mathbb{I}(z_{K i} > 0)\), where \(z_{K i}\) denotes the \(i\) - th vector in \(K\) semantic tokens \(Z_{K}\). Then the index of the quantized feature \(z_{m i}\) is formulated as \(\text{Index}(z_{K i}) = \sum_{j = 1}^{N} 2^{k - 1} \cdot \mathbb{I}(z_{K i j} > 0)\). Finally, we obtain the quantized semantic tokens, denoted as \(Z_{K q}\): \(Z_{K q}=\mathcal{Q}(Z_{K}, \mathcal{C})\).

Token Recovery and Reconstruction: We first perform token - level reconstruction with the recovery module \(R(\cdot, \cdot)\) and source matrix \(S\), which yields a new \(L\) - length \(\hat{Z}_{L}\) as \(\hat{Z}_{L}=\mathcal{R}(Z_{q K}, S)\). This sequence is then fed into the decoder \(D\) for pixel - level reconstruction, which can be described as \(\hat{X}=\mathcal{D}(\hat{Z}_{L})\).

3.2 Harmonize Reconstruction and Representation

Inspired by Masked Image Modeling in representation learning, we employ Token Merge and Source Recovery to seamlessly integrate representation learning into the MergeVQ framework. Additionally, we impose global alignment constraints to further enhance the performance and stability of the model.

Overview of MergeVQ framework

Efficient Generation

4.1 MergeAR with KV Cache Compression

To achieve efficient autoregressive generation, we introduce MergeAR. This innovative approach leverages token sparsity and a position - recording system to significantly accelerate the generation process. During training, we sample a merge ratio \(r\), introduce a Merge Instruction Token \(M\), and construct a causal mask. In inference, we utilize a KV cache to prune repeated tokens, further improving the efficiency of the model.

Illustration of MergeAR pipeline

4.2 Randomized Auto - regressive with Source Recovery

MergeVQ can be effectively implemented using the RandAR generative framework. The \(K\) quantized tokens and source matrix are utilized for both training and generation. We employ the source recovery model and decoder to recover tokens, ensuring accurate and efficient generation of visual data.

Experiments

5.1 Implementation Details

We offer three distinct versions of MergeVQ to cater to different application scenarios: MergeVQ (G) for pure generation, MergeVQ (G + R) for both generation and representation, and MergeVQ (R) for representation learning only. Each version is equipped with encoders of different architectures and parameters, tailored to specific requirements. We utilize the AdamW optimizer and various loss functions for training, ensuring optimal performance of the model. The visual generator is based on the LlaMA - based architecture and is trained with specific settings to achieve high - quality visual generation.

5.2 Self - supervised Pre - training

To evaluate the performance of self - supervised pre - trained models, we conducted linear probing and end - to - end fine - tuning experiments on ImageNet - 1K. The results, presented in Table 1, provide valuable insights into the effectiveness of MergeVQ compared to existing methods.

Support Tasks Method Date Align. Target Rec. Target Epochs Encoder Type #Param #Tokens Accuracy (Lin.) Accuracy (FT)
Contrastive Pre - training BYOL [22] NeurIPS’2020 MSE 800 R50 - W2 94M 7×7 75.6
Contrastive Pre - training MoCoV3 [12] ICCV’2021 InfoNCE 300 ViT - B 86M 196 76.7 83.2
Contrastive Pre - training DINO ‡ [9] ICCV’2021 CE 300 ViT - B 86M 196 78.2 83.6
Contrastive Pre - training DINOv2 ‡ [46] TMLR’2024 CE 1000 ViT - B 86M 196 84.5 85.7
MIM Pre - training BEiT [3] ICLR’2022 DALLE 800 ViT - B 86M 196 56.7 83.2
MIM Pre - training iBOT ‡ [75] ICLR’2022 CE EMA 800 ViT - B 86M 196 76.0 84.0
MIM Pre - training MAE [24] CVPR’2022 RGB 1600 ViT - B 86M 196 68.0 83.6
MIM Pre - training SimMIM [62] CVPR’2022 RGB 800 ViT - B 86M 196 67.9 83.8
MIM Pre - training CAE [13] IJCV’2023 DALLE 1600 ViT - B 86M 196 70.4 83.6
MIM Pre - training PeCo [14] AAAI’2023 VQVAE 800 ViT - B 86M 196 72.3 83.9
Ours MergeVQ (R) - CE LFQ 800 ViT - B 86M 196 80.1 85.1
MergeVQ (G+R) - CE LFQ 800 ViT - B 86M 196 79.8 84.9
MergeVQ (G) - LFQ 800 ViT - B 86M 196 79.2 84.3

5.3 Image Generation

We conducted comprehensive image generation experiments to assess the performance of MergeVQ. By comparing it with several state - of - the - art methods in terms of FID and IS scores, we aimed to provide a comprehensive evaluation of its effectiveness. The results, presented in Table 2, highlight the superior performance of MergeVQ in image generation tasks.

Method FID ↓ IS ↑ Speed (imgs/s)
StableDiffusion [61] 11.32 12.04 0.5
OpenMAGVIT2 [42] 8.97 13.56 0.3
MaskGiT [10] 9.12 13.21 1.2
ToMe + MaskGiT [6] 9.34 13.12 2.5
MergeVQ (G) 7.89 14.23 3.0
MergeVQ (G+R) 7.95 14.18 2.8
Visualization of results

Contributions

Please feel free to raise issues or submit pull requests to contribute to our codebase.

Citation

    @article{li2025mergevq,
      title={MergeVQ: A Unified Framework for Visual Generation and Representation with Token Merging and Quantization},
    }