WaveSpeed Caching

Overview

WaveSpeed is the project's caching-oriented optimization layer for reusing work across denoising steps. In the current codebase, the integrated path is DeepCache for UNet-based models, and the repository also contains groundwork for a Flux-oriented First Block Cache path.

LightDiffusion-Next contains two WaveSpeed-related implementations:

DeepCache — Integrated for UNet-based models (SD1.5, SDXL)
First Block Cache (FBCache) — Flux-oriented cache machinery present in the codebase

Both are training-free. DeepCache is the user-facing path today; First Block Cache is codebase groundwork for a more specialized transformer caching path.

How It Works

Core Insight

Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that:

High-level features (semantic structure, composition) change slowly across steps
Low-level features (fine details, textures) require frequent updates

WaveSpeed aims to reduce repeated computation across nearby denoising steps by reusing information from earlier steps where practical.

DeepCache (UNet Models)

DeepCache is the integrated WaveSpeed path for UNet models.

Cache step (every N steps): 1. Run the full denoiser path 2. Store the output for later reuse

Reuse step (intermediate steps): 1. Reuse the cached denoiser output 2. Skip the full model recomputation for that step

Speedup: ~50-70% time saved per reuse step → 2-3x total speedup with interval=3

First Block Cache (Flux Models)

Flux uses Transformer blocks instead of UNet convolutions. The repository includes a First Block Cache implementation for this architecture family:

┌─────────────────────────────────────────┐
│ First Transformer Block (always run)    │ ← Computes initial features
├─────────────────────────────────────────┤
│ Remaining Blocks (cached if similar)    │ ← FBCache caching zone
└─────────────────────────────────────────┘

Cache decision logic: 1. Run first Transformer block 2. Compare output to previous step's output 3. If difference < threshold: reuse cached remaining blocks 4. If difference ≥ threshold: run all blocks and update cache

In the current project structure, this cache path is implementation groundwork rather than a standard generation toggle like DeepCache.

DeepCache Configuration

Parameters

Parameter	Type	Default	Description
`cache_interval`	int	3	Steps between cache updates (higher = faster, lower quality)
`cache_depth`	int	2	UNet depth for caching (0-12, higher = more aggressive)
`start_step`	int	0	Timestep to start caching (0-1000)
`end_step`	int	1000	Timestep to stop caching (0-1000)

Streamlit UI

Enable in the ⚡ DeepCache Acceleration expander:

Check Enable DeepCache
Adjust sliders:
Cache Interval: 1-10 (default: 3)
Cache Depth: 0-12 (default: 2)
Start/End Steps: 0-1000 (default: 0/1000)
Generate images — caching applies transparently

REST API

curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a misty forest at twilight",
        "width": 768,
        "height": 512,
        "deepcache_enabled": true,
        "deepcache_interval": 3,
        "deepcache_depth": 2
      }'

Recommended Presets

Balanced (Default)

cache_interval: 3
cache_depth: 2
start_step: 0
end_step: 1000

Speedup: 2-2.3x
Quality loss: Very slight (1-2%)
Use case: Everyday generation

Maximum Speed

cache_interval: 5
cache_depth: 3
start_step: 0
end_step: 1000

Speedup: 2.5-3x
Quality loss: Noticeable (5-7%)
Use case: Rapid prototyping, batch jobs

Maximum Quality

cache_interval: 2
cache_depth: 1
start_step: 0
end_step: 1000

Speedup: 1.5-2x
Quality loss: Minimal (<1%)
Use case: Final renders, client work

Partial Caching (Critical Steps Only)

cache_interval: 3
cache_depth: 2
start_step: 200
end_step: 800

Speedup: 1.8-2.2x
Quality loss: Minimal
Use case: Preserve early structure, late details

First Block Cache (FBCache) Configuration

Parameters

Parameter	Type	Default	Description
`residual_diff_threshold`	float	0.05	Max feature difference to trigger cache reuse (0.0-1.0)

Usage

First Block Cache is not currently exposed as a standard per-generation toggle. The implementation is available in the codebase for specialized integration work:

# In src/user/pipeline.py
from src.WaveSpeed import fbcache_nodes

# Create cache context
cache_context = fbcache_nodes.create_cache_context()

# Apply caching to a Flux-style model
with fbcache_nodes.cache_context(cache_context):
    patched_model = fbcache_nodes.create_patch_flux_forward_orig(
        flux_model,
        residual_diff_threshold=0.05,  # Lower = stricter caching
    )
    # Generate images...

Tuning Threshold

Lower threshold (0.01-0.03): Stricter caching, recomputes more often, higher quality
Higher threshold (0.05-0.1): Looser caching, reuses more often, higher speedup
Recommended: 0.05 (balances quality and speed)

Performance

Speedup Guidance

Speedup scales with cache interval and depth:

Model	Cache Interval	Expected Behavior
SD1.5	2	Moderate speedup, minimal quality loss
SD1.5	3	Good speedup, slight quality loss
SD1.5	5	High speedup, noticeable quality loss
SDXL	3	Good speedup, slight quality loss
Flux-style caching paths	implementation-specific	Depends on the integration path

Performance varies based on: - GPU architecture - Model size - Resolution - Sampler choice - Number of steps

Recommendation: Start with interval=3 and adjust based on your quality requirements.### VRAM Impact

Caching increases VRAM usage slightly (50-200MB depending on resolution):

Model	Baseline VRAM	+ DeepCache	Increase
SD1.5 (768×512)	3.2 GB	3.4 GB	+200 MB
SDXL (1024×1024)	6.8 GB	7.0 GB	+200 MB
Flux (832×1216)	12.5 GB	12.6 GB	+100 MB

Stacking with Other Optimizations

WaveSpeed is fully compatible with SageAttention, SpargeAttn and Stable-Fast:

DeepCache + SageAttention

deepcache_enabled: true
deepcache_interval: 3
# SageAttention auto-detected

Result: 2.2x (DeepCache) × 1.15 (SageAttention) = ~2.5x total speedup

DeepCache + SpargeAttn

deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected

Result: Enhanced speedup from caching and sparse attention

DeepCache + Stable-Fast + SpargeAttn

stable_fast: true
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected

Result: Maximum combined speedup (all optimizations active, batch operations only)

Compatibility

DeepCache Compatible With

✅ Stable Diffusion 1.5
✅ Stable Diffusion 2.1
✅ SDXL
✅ All samplers (Euler, DPM++, etc.)
✅ LoRA adapters
✅ Textual inversion embeddings
✅ HiresFix
✅ ADetailer
✅ Multi-scale diffusion
✅ SageAttention/SpargeAttn
✅ Stable-Fast

DeepCache NOT Compatible With

❌ Flux models (use FBCache instead)
❌ Img2Img mode (can cause artifacts)

FBCache Compatible With

✅ Flux models
✅ SageAttention/SpargeAttn
✅ All Flux-compatible features

FBCache NOT Compatible With

❌ SD1.5/SDXL (use DeepCache instead)
❌ Stable-Fast (Flux not supported by Stable-Fast)

Troubleshooting

No Speedup Observed

Causes: 1. DeepCache disabled or not applied to correct model type 2. Cache interval too low (interval=1 provides no caching) 3. Model loaded incorrectly

Fixes:

# Check logs for DeepCache activation
cat logs/server.log | grep -i "deepcache\|cache"

# Verify UI toggle is enabled
# Streamlit: Check "Enable DeepCache" checkbox
# API: Ensure "deepcache_enabled": true in payload

# Try higher interval
deepcache_interval: 3  # Instead of 1 or 2

Quality Degradation

Symptoms: - Blurry details - Smoothed textures - Loss of fine patterns

Causes: 1. Cache interval too high 2. Cache depth too aggressive 3. Wrong model type (Flux using DeepCache)

Fixes:

# Reduce cache interval
deepcache_interval: 2  # Down from 5

# Reduce cache depth
deepcache_depth: 1  # Down from 3

# Disable caching for critical phases
deepcache_start_step: 200  # Skip early structure formation
deepcache_end_step: 800    # Skip late detail refinement

Artifacts in Img2Img

Symptom: Visible seams, inconsistent styles when using DeepCache with Img2Img.

Cause: Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency.

Fix: Disable DeepCache for Img2Img:

deepcache_enabled: false  # When img2img_enabled: true

VRAM Increase

Symptom: OOM errors after enabling DeepCache.

Cause: Cached features consume additional VRAM.

Fixes: 1. Reduce batch size 2. Lower resolution 3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs) 4. Use lower cache depth: yaml deepcache_depth: 1 # Minimal caching

Flux FBCache Not Working

Symptom: No speedup with Flux generation.

Cause: FBCache implementation is more subtle — check logs for cache hit rate.

Debugging:

# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG

# Check cache statistics
cat logs/server.log | grep "cache"

If no cache hits, try adjusting threshold:

# In pipeline.py
residual_diff_threshold=0.1  # Increase from 0.05 for more cache reuse

Quality Comparison

Visual impact of different cache intervals:

Interval	Speed	Visual Difference
Disabled	Baseline	Baseline (100% quality)
2	Faster	Virtually identical
3	Much faster	Very subtle smoothing
5	Very fast	Noticeable detail loss
7+	Fastest	Obvious quality degradation

Recommendation: Start with interval=3 and adjust based on visual results.

Technical Details

DeepCache Implementation

Simplified pseudocode:

class DeepCacheWrapper:
    def __init__(self, model, interval, depth):
        self.model = model
        self.interval = interval
        self.cached_output = None
        self.current_step = 0

    def forward(self, x, timestep):
        is_cache_step = (self.current_step % self.interval == 0)

        if is_cache_step:
            # Run full model, cache output
            output = self.model(x, timestep)
            self.cached_output = output.clone()
        else:
            # Reuse cached output (skip expensive computation)
            output = self.cached_output

        self.current_step += 1
        return output

Actual implementation in src/WaveSpeed/deepcache_nodes.py includes: - Proper timestep tracking - Cache invalidation on batch changes - Error handling and fallback to full forward

FBCache Residual Comparison

# Compute first block output
first_output = first_transformer_block(hidden_states)

# Compare to previous step
residual = first_output - previous_first_output
residual_norm = residual.abs().mean() / first_output.abs().mean()

if residual_norm < threshold:
    # Feature change is small — reuse cached blocks
    hidden_states = apply_cached_residual(first_output)
else:
    # Feature change is large — recompute all blocks
    hidden_states = run_remaining_blocks(first_output)
    cache_residual(hidden_states)

Best Practices

For Everyday Use

Enable DeepCache with default settings (interval=3, depth=2)
Stack with SageAttention for 2.5x+ total speedup
Disable for final client renders if absolute quality is critical

For Batch Processing

Use aggressive caching (interval=5, depth=3)
Pre-generate previews at high speed, re-render winners at full quality
Disable TAESD previews to avoid overhead (set enable_preview=false)

For Low VRAM

Use conservative caching (interval=2, depth=1)
Avoid stacking with Stable-Fast CUDA graphs
Monitor VRAM via /api/telemetry endpoint

Citation

If you use WaveSpeed/DeepCache in your work:

@inproceedings{ma2023deepcache,
  title={DeepCache: Accelerating Diffusion Models for Free},
  author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  booktitle={CVPR},
  year={2024}
}

Resources

DeepCache Paper
DeepCache Repository
ComfyUI DeepCache Implementation (reference for LightDiffusion-Next)
First Block Cache Discussion