Overview
WaveSpeed is a collection of feature caching strategies that exploit temporal redundancy in diffusion processes. By reusing high-level features across multiple denoising steps, WaveSpeed can provide significant speedup with tunable quality trade-offs.
LightDiffusion-Next implements two WaveSpeed variants:
- DeepCache — For UNet-based models (SD1.5, SDXL)
- First Block Cache (FBCache) — For Transformer-based models (Flux)
Both are training-free, work alongside other optimizations and can be toggled per-generation.
How It Works
Core Insight
Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that:
- High-level features (semantic structure, composition) change slowly across steps
- Low-level features (fine details, textures) require frequent updates
WaveSpeed caches the expensive high-level computations and reuses them for several steps, only updating low-level details cheaply.
DeepCache (UNet Models)
DeepCache targets the middle and output blocks of the UNet architecture:
┌─────────────────────────────────────────┐
│ Input Blocks (always computed) │
├─────────────────────────────────────────┤
│ Middle Blocks (cached every N steps) │ ← DeepCache caching zone
├─────────────────────────────────────────┤
│ Output Blocks (cached every N steps) │ ← DeepCache caching zone
└─────────────────────────────────────────┘
Cache step (every N steps): 1. Run full forward pass through all UNet blocks 2. Store middle/output block activations in cache
Reuse step (N-1 intermediate steps): 1. Run only input blocks 2. Retrieve cached middle/output activations 3. Skip expensive middle/output block computation
Speedup: ~50-70% time saved per reuse step → 2-3x total speedup with interval=3
First Block Cache (Flux Models)
Flux uses Transformer blocks instead of UNet convolutions. FBCache applies a similar principle:
┌─────────────────────────────────────────┐
│ First Transformer Block (always run) │ ← Computes initial features
├─────────────────────────────────────────┤
│ Remaining Blocks (cached if similar) │ ← FBCache caching zone
└─────────────────────────────────────────┘
Cache decision logic: 1. Run first Transformer block 2. Compare output to previous step's output 3. If difference < threshold: reuse cached remaining blocks 4. If difference ≥ threshold: run all blocks and update cache
Adaptive caching: Automatically decides when to cache vs. recompute based on feature similarity.
DeepCache Configuration
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
cache_interval |
int | 3 | Steps between cache updates (higher = faster, lower quality) |
cache_depth |
int | 2 | UNet depth for caching (0-12, higher = more aggressive) |
start_step |
int | 0 | Timestep to start caching (0-1000) |
end_step |
int | 1000 | Timestep to stop caching (0-1000) |
Streamlit UI
Enable in the ⚡ DeepCache Acceleration expander:
- Check Enable DeepCache
- Adjust sliders:
- Cache Interval: 1-10 (default: 3)
- Cache Depth: 0-12 (default: 2)
- Start/End Steps: 0-1000 (default: 0/1000)
- Generate images — caching applies transparently
REST API
curl -X POST http://localhost:7861/api/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a misty forest at twilight",
"width": 768,
"height": 512,
"deepcache_enabled": true,
"deepcache_interval": 3,
"deepcache_depth": 2
}'
Recommended Presets
Balanced (Default)
cache_interval: 3
cache_depth: 2
start_step: 0
end_step: 1000
- Speedup: 2-2.3x
- Quality loss: Very slight (1-2%)
- Use case: Everyday generation
Maximum Speed
cache_interval: 5
cache_depth: 3
start_step: 0
end_step: 1000
- Speedup: 2.5-3x
- Quality loss: Noticeable (5-7%)
- Use case: Rapid prototyping, batch jobs
Maximum Quality
cache_interval: 2
cache_depth: 1
start_step: 0
end_step: 1000
- Speedup: 1.5-2x
- Quality loss: Minimal (<1%)
- Use case: Final renders, client work
Partial Caching (Critical Steps Only)
cache_interval: 3
cache_depth: 2
start_step: 200
end_step: 800
- Speedup: 1.8-2.2x
- Quality loss: Minimal
- Use case: Preserve early structure, late details
First Block Cache (FBCache) Configuration
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
residual_diff_threshold |
float | 0.05 | Max feature difference to trigger cache reuse (0.0-1.0) |
Usage
FBCache is applied automatically when generating Flux images. No UI controls yet — configured via pipeline code:
# In src/user/pipeline.py
from src.WaveSpeed import fbcache_nodes
# Create cache context
cache_context = fbcache_nodes.create_cache_context()
# Apply caching to Flux model
with fbcache_nodes.cache_context(cache_context):
patched_model = fbcache_nodes.create_patch_flux_forward_orig(
flux_model,
residual_diff_threshold=0.05, # Lower = stricter caching
)
# Generate images...
Tuning Threshold
- Lower threshold (0.01-0.03): Stricter caching, recomputes more often, higher quality
- Higher threshold (0.05-0.1): Looser caching, reuses more often, higher speedup
- Recommended: 0.05 (balances quality and speed)
Performance
Speedup Guidance
Speedup scales with cache interval and depth:
| Model | Cache Interval | Expected Behavior |
|---|---|---|
| SD1.5 | 2 | Moderate speedup, minimal quality loss |
| SD1.5 | 3 | Good speedup, slight quality loss |
| SD1.5 | 5 | High speedup, noticeable quality loss |
| SDXL | 3 | Good speedup, slight quality loss |
| Flux (FBCache) | auto | Moderate speedup, minimal quality loss |
Performance varies based on: - GPU architecture - Model size - Resolution - Sampler choice - Number of steps
Recommendation: Start with interval=3 and adjust based on your quality requirements.### VRAM Impact
Caching increases VRAM usage slightly (50-200MB depending on resolution):
| Model | Baseline VRAM | + DeepCache | Increase |
|---|---|---|---|
| SD1.5 (768×512) | 3.2 GB | 3.4 GB | +200 MB |
| SDXL (1024×1024) | 6.8 GB | 7.0 GB | +200 MB |
| Flux (832×1216) | 12.5 GB | 12.6 GB | +100 MB |
Stacking with Other Optimizations
WaveSpeed is fully compatible with SageAttention, SpargeAttn and Stable-Fast:
DeepCache + SageAttention
deepcache_enabled: true
deepcache_interval: 3
# SageAttention auto-detected
Result: 2.2x (DeepCache) × 1.15 (SageAttention) = ~2.5x total speedup
DeepCache + SpargeAttn
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected
Result: Enhanced speedup from caching and sparse attention
DeepCache + Stable-Fast + SpargeAttn
stable_fast: true
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected
Result: Maximum combined speedup (all optimizations active, batch operations only)
Compatibility
DeepCache Compatible With
- ✅ Stable Diffusion 1.5
- ✅ Stable Diffusion 2.1
- ✅ SDXL
- ✅ All samplers (Euler, DPM++, etc.)
- ✅ LoRA adapters
- ✅ Textual inversion embeddings
- ✅ HiresFix
- ✅ ADetailer
- ✅ Multi-scale diffusion
- ✅ SageAttention/SpargeAttn
- ✅ Stable-Fast
DeepCache NOT Compatible With
- ❌ Flux models (use FBCache instead)
- ❌ Img2Img mode (can cause artifacts)
FBCache Compatible With
- ✅ Flux models
- ✅ SageAttention/SpargeAttn
- ✅ All Flux-compatible features
FBCache NOT Compatible With
- ❌ SD1.5/SDXL (use DeepCache instead)
- ❌ Stable-Fast (Flux not supported by Stable-Fast)
Troubleshooting
No Speedup Observed
Causes: 1. DeepCache disabled or not applied to correct model type 2. Cache interval too low (interval=1 provides no caching) 3. Model loaded incorrectly
Fixes:
# Check logs for DeepCache activation
cat logs/server.log | grep -i "deepcache\|cache"
# Verify UI toggle is enabled
# Streamlit: Check "Enable DeepCache" checkbox
# API: Ensure "deepcache_enabled": true in payload
# Try higher interval
deepcache_interval: 3 # Instead of 1 or 2
Quality Degradation
Symptoms: - Blurry details - Smoothed textures - Loss of fine patterns
Causes: 1. Cache interval too high 2. Cache depth too aggressive 3. Wrong model type (Flux using DeepCache)
Fixes:
# Reduce cache interval
deepcache_interval: 2 # Down from 5
# Reduce cache depth
deepcache_depth: 1 # Down from 3
# Disable caching for critical phases
deepcache_start_step: 200 # Skip early structure formation
deepcache_end_step: 800 # Skip late detail refinement
Artifacts in Img2Img
Symptom: Visible seams, inconsistent styles when using DeepCache with Img2Img.
Cause: Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency.
Fix: Disable DeepCache for Img2Img:
deepcache_enabled: false # When img2img_enabled: true
VRAM Increase
Symptom: OOM errors after enabling DeepCache.
Cause: Cached features consume additional VRAM.
Fixes:
1. Reduce batch size
2. Lower resolution
3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs)
4. Use lower cache depth:
yaml
deepcache_depth: 1 # Minimal caching
Flux FBCache Not Working
Symptom: No speedup with Flux generation.
Cause: FBCache implementation is more subtle — check logs for cache hit rate.
Debugging:
# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG
# Check cache statistics
cat logs/server.log | grep "cache"
If no cache hits, try adjusting threshold:
# In pipeline.py
residual_diff_threshold=0.1 # Increase from 0.05 for more cache reuse
Quality Comparison
Visual impact of different cache intervals:
| Interval | Speed | Visual Difference |
|---|---|---|
| Disabled | Baseline | Baseline (100% quality) |
| 2 | Faster | Virtually identical |
| 3 | Much faster | Very subtle smoothing |
| 5 | Very fast | Noticeable detail loss |
| 7+ | Fastest | Obvious quality degradation |
Recommendation: Start with interval=3 and adjust based on visual results.
Technical Details
DeepCache Implementation
Simplified pseudocode:
class DeepCacheWrapper:
def __init__(self, model, interval, depth):
self.model = model
self.interval = interval
self.cached_output = None
self.current_step = 0
def forward(self, x, timestep):
is_cache_step = (self.current_step % self.interval == 0)
if is_cache_step:
# Run full model, cache output
output = self.model(x, timestep)
self.cached_output = output.clone()
else:
# Reuse cached output (skip expensive computation)
output = self.cached_output
self.current_step += 1
return output
Actual implementation in src/WaveSpeed/deepcache_nodes.py includes:
- Proper timestep tracking
- Cache invalidation on batch changes
- Error handling and fallback to full forward
FBCache Residual Comparison
# Compute first block output
first_output = first_transformer_block(hidden_states)
# Compare to previous step
residual = first_output - previous_first_output
residual_norm = residual.abs().mean() / first_output.abs().mean()
if residual_norm < threshold:
# Feature change is small — reuse cached blocks
hidden_states = apply_cached_residual(first_output)
else:
# Feature change is large — recompute all blocks
hidden_states = run_remaining_blocks(first_output)
cache_residual(hidden_states)
Best Practices
For Everyday Use
- Enable DeepCache with default settings (
interval=3,depth=2) - Stack with SageAttention for 2.5x+ total speedup
- Disable for final client renders if absolute quality is critical
For Batch Processing
- Use aggressive caching (
interval=5,depth=3) - Pre-generate previews at high speed, re-render winners at full quality
- Disable TAESD previews to avoid overhead (set
enable_preview=false)
For Low VRAM
- Use conservative caching (
interval=2,depth=1) - Avoid stacking with Stable-Fast CUDA graphs
- Monitor VRAM via
/api/telemetryendpoint
Citation
If you use WaveSpeed/DeepCache in your work:
@inproceedings{ma2023deepcache,
title={DeepCache: Accelerating Diffusion Models for Free},
author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
booktitle={CVPR},
year={2024}
}
Resources
- DeepCache Paper
- DeepCache Repository
- ComfyUI DeepCache Implementation (reference for LightDiffusion-Next)
- First Block Cache Discussion