Token Merging (ToMe)

Overview

Token Merging (ToMe) is a performance optimization that accelerates diffusion models by intelligently merging similar tokens in the attention mechanism. By identifying and combining redundant computations, ToMe achieves 20-60% speedup with minimal quality impact.

Unlike feature caching (DeepCache, WaveSpeed), ToMe reduces the computational graph itself — fewer tokens means fewer attention operations, less memory bandwidth, and faster generation.

This is a training-free, drop-in optimization that works with all Stable Diffusion models (SD1.5, SDXL) and can be combined with other speedup techniques.

How It Works

The Token Redundancy Problem

Diffusion models process images as sequences of tokens (patches):

Input Image (512×512) → Tokenize → 4096 tokens (64×64 grid of 8×8 patches)

At each attention layer, every token attends to every other token:

\[ \text{Attention Cost} = O(N^2 \cdot D) \]

Where: - $N$ = number of tokens (e.g., 4096 for 512×512) - $D$ = embedding dimension (e.g., 768 or 1024)

Key insight: Many tokens are highly similar (e.g., sky regions, uniform backgrounds, smooth gradients). Computing attention between nearly-identical tokens is redundant.

The ToMe Solution

Token Merging reduces redundancy through bipartite matching:

Step 1: Split tokens into two sets
┌─────────────────────┬─────────────────────┐
│ Destination Set (dst)│ Source Set (src)    │
│ [Token 1, 3, 5, ...] │ [Token 2, 4, 6, ...] │
└─────────────────────┴─────────────────────┘

Step 2: Compute similarity (cosine distance)
   dst[0] ↔ src[0]: 0.92  (highly similar!)
   dst[0] ↔ src[1]: 0.34
   dst[0] ↔ src[2]: 0.18
   ...

Step 3: Merge most similar pairs
   merged_token[0] = (dst[0] + src[0]) / 2

Step 4: Continue with fewer tokens
   4096 tokens → 2048 tokens (50% merge ratio)
   Attention cost reduced by ~4x

This happens per attention layer, with merge ratio dynamically adjusting based on layer depth.

Configuration

Parameters

Parameter	Type	Default	Range	Description
`tome_enabled`	bool	`False`	-	Enable Token Merging
`tome_ratio`	float	`0.5`	0.0-0.9	Percentage of tokens to merge (higher = faster, lower quality)
`tome_max_downsample`	int	`1`	1, 2, 4, 8	Apply ToMe to layers with downsampling ≤ this value

Choosing `tome_max_downsample`

Controls which UNet layers apply ToMe:

Value	Layers Affected	Speed vs Quality
1	Only full-resolution layers (4/15)	Conservative, minimal quality impact
2	Half-resolution layers (8/15)	Balanced (recommended)
4	Quarter-resolution layers (12/15)	Aggressive
8	All layers (15/15)	Maximum speedup, noticeable quality loss

Recommendation: Start with max_downsample=1. Only increase if you need more speedup and can tolerate quality reduction.

Usage

Streamlit UI

Enable in the 🔀 Token Merging (ToMe) expander:

Check Enable Token Merging
Select a preset:
Conservative — 30% merge, max_downsample=2 (minimal impact)
Balanced — 50% merge, max_downsample=1 (recommended)
Aggressive — 70% merge, max_downsample=1 (maximum speed)
Custom — Manual slider control
Generate images — console confirms activation

Visual feedback:

✓ Token Merging ACTIVE: 50% merge ratio, max_downsample=1

REST API

Include in your generation request:

curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a cyberpunk cityscape at night, neon lights",
        "width": 1024,
        "height": 512,
        "steps": 25,
        "tome_enabled": true,
        "tome_ratio": 0.5,
        "tome_max_downsample": 1
      }'

Python API

from src.user.pipeline import pipeline

pipeline(
    prompt="a detailed fantasy castle on a cliff",
    w=768,
    h=1024,
    steps=30,
    sampler="dpmpp_sde_cfgpp",
    scheduler="ays",
    tome_enabled=True,
    tome_ratio=0.5,
    tome_max_downsample=1,
    number=4  # Generate multiple images faster
)

Troubleshooting

"No speedup detected"

Possible causes: 1. tomesd not installed — Install with pip install tomesd 2. Other bottlenecks — Enable only ToMe for isolated testing 3. Very low resolution — ToMe benefits are minimal below 512px

Solutions:

# Check installation
python -c "import tomesd; print('ToMe available')"

# Test in isolation at 1024×512 (ideal resolution for ToMe)
python quick_tome_test.py

"Images look blurry or soft"

Cause: tome_ratio too high (>0.6) or max_downsample too aggressive (>2).

Solutions: - Reduce tome_ratio to 0.4-0.5 - Lower max_downsample to 1 - Increase steps to 30-35 for better convergence - Disable ToMe for final high-quality renders

"Minimal speedup despite 70% merge"

Cause: Other optimizations (DeepCache, Multi-Scale) already bottlenecked elsewhere (VAE decode, sampling overhead).

Solutions: - Profile with isolated tests (disable all other optimizations) - Ensure GPU isn't memory-bound (reduce batch size) - Check system monitoring for CPU/disk bottlenecks

"Model fails to load / tomesd errors"

Cause: Outdated tomesd version or incompatible model architecture.

Solutions:

# Update tomesd
pip install --upgrade tomesd

# Check compatibility (ToMe only works with UNet-based models)
# Flux/Transformer models require different ToMe variant (not yet supported)

Technical Details

Implementation

ToMe is applied via the ModelPatcher class (src/Model/ModelPatcher.py):

def apply_tome(self, ratio: float = 0.5, max_downsample: int = 1) -> bool:
    """Apply Token Merging to the diffusion model."""
    # Remove any existing patch (handles cached models)
    try:
        tomesd.remove_patch(self)
    except:
        pass

    # Apply ToMe patch
    tomesd.apply_patch(
        self,  # ModelPatcher with .model.diffusion_model structure
        ratio=ratio,
        max_downsample=max_downsample
    )
    self.tome_enabled = True
    return True

Cache handling: ToMe patches are removed after each generation and re-applied as needed, ensuring correct behavior with model caching.

Bipartite Matching Algorithm

ToMe uses proportional attention-based matching:

Partition tokens: $$ T_{\text{dst}}, T_{\text{src}} = \text{partition}(T, \text{stride}=(2,2)) $$
Compute similarity matrix: $$ S_{ij} = \frac{T_{\text{dst}}[i] \cdot T_{\text{src}}[j]}{||T_{\text{dst}}[i]|| \cdot ||T_{\text{src}}[j]||} $$
Find top-k matches: $$ k = \lfloor \text{ratio} \times |T_{\text{src}}| \rfloor $$
Merge tokens: $$ T'[i] = \frac{T_{\text{dst}}[i] + T_{\text{src}}[\text{match}(i)]}{2} $$

Compatibility

Feature	Compatible?	Notes
SD1.5 models	✓	Full support, tested extensively
SDXL models	✓	Full support, larger speedup
Flux models	✗	UNet-specific, Transformer variant TBD
All samplers	✓	ToMe patches attention, agnostic to sampler
CFG-Free	✓	No interaction, both apply independently
DeepCache	✓	Excellent combination, speedups multiply
Multi-Scale	✓	Compatible, benefits stack
HiRes Fix	✓	Applied to all upscaling passes
ADetailer	✓	Applied to detail-enhancement passes
Stable-Fast	✓	Can combine for maximum speedup

Limitations

UNet-only: Transformer architectures (Flux) use different attention patterns — dedicated Transformer-ToMe needed
Detail sensitivity: High-frequency textures (fabric weave, individual hairs) see most quality impact
Diminishing returns: Beyond 60% merge, quality degrades faster than speed improves
One-time patch: Doesn't adapt merge ratio dynamically during generation

DeepCache: Feature caching — complements ToMe, speedups multiply (~2.8x combined)
Multi-Scale Diffusion: Resolution-based optimization — also reduces token count
Stable-Fast: Compilation-based speedup — can combine for maximum performance

References & Further Reading

Original Paper: Token Merging for Fast Stable Diffusion (Bolya & Hoffman, 2023)
tomesd Library: https://github.com/dbolya/tomesd
ToMe for Vision Transformers: https://github.com/facebookresearch/ToMe