Overview
WaveSpeed is the project's caching-oriented optimization layer for reusing work across denoising steps. In the current codebase, the integrated path is DeepCache for UNet-based models, and the repository also contains groundwork for a Flux-oriented First Block Cache path.
LightDiffusion-Next contains two WaveSpeed-related implementations:
- DeepCache — Integrated for UNet-based models (SD1.5, SDXL)
- First Block Cache (FBCache) — Flux-oriented cache machinery present in the codebase
Both are training-free. DeepCache is the user-facing path today; First Block Cache is codebase groundwork for a more specialized transformer caching path.
How It Works
Core Insight
Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that:
- High-level features (semantic structure, composition) change slowly across steps
- Low-level features (fine details, textures) require frequent updates
WaveSpeed aims to reduce repeated computation across nearby denoising steps by reusing information from earlier steps where practical.
DeepCache (UNet Models)
DeepCache is the integrated WaveSpeed path for UNet models.
Cache step (every N steps): 1. Run the full denoiser path 2. Store the output for later reuse
Reuse step (intermediate steps): 1. Reuse the cached denoiser output 2. Skip the full model recomputation for that step
Speedup: ~50-70% time saved per reuse step → 2-3x total speedup with interval=3
First Block Cache (Flux Models)
Flux uses Transformer blocks instead of UNet convolutions. The repository includes a First Block Cache implementation for this architecture family:
┌─────────────────────────────────────────┐
│ First Transformer Block (always run) │ ← Computes initial features
├─────────────────────────────────────────┤
│ Remaining Blocks (cached if similar) │ ← FBCache caching zone
└─────────────────────────────────────────┘
Cache decision logic: 1. Run first Transformer block 2. Compare output to previous step's output 3. If difference < threshold: reuse cached remaining blocks 4. If difference ≥ threshold: run all blocks and update cache
In the current project structure, this cache path is implementation groundwork rather than a standard generation toggle like DeepCache.
DeepCache Configuration
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
cache_interval |
int | 3 | Steps between cache updates (higher = faster, lower quality) |
cache_depth |
int | 2 | UNet depth for caching (0-12, higher = more aggressive) |
start_step |
int | 0 | Timestep to start caching (0-1000) |
end_step |
int | 1000 | Timestep to stop caching (0-1000) |
Streamlit UI
Enable in the ⚡ DeepCache Acceleration expander:
- Check Enable DeepCache
- Adjust sliders:
- Cache Interval: 1-10 (default: 3)
- Cache Depth: 0-12 (default: 2)
- Start/End Steps: 0-1000 (default: 0/1000)
- Generate images — caching applies transparently
REST API
curl -X POST http://localhost:7861/api/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a misty forest at twilight",
"width": 768,
"height": 512,
"deepcache_enabled": true,
"deepcache_interval": 3,
"deepcache_depth": 2
}'
Recommended Presets
Balanced (Default)
cache_interval: 3
cache_depth: 2
start_step: 0
end_step: 1000
- Speedup: 2-2.3x
- Quality loss: Very slight (1-2%)
- Use case: Everyday generation
Maximum Speed
cache_interval: 5
cache_depth: 3
start_step: 0
end_step: 1000
- Speedup: 2.5-3x
- Quality loss: Noticeable (5-7%)
- Use case: Rapid prototyping, batch jobs
Maximum Quality
cache_interval: 2
cache_depth: 1
start_step: 0
end_step: 1000
- Speedup: 1.5-2x
- Quality loss: Minimal (<1%)
- Use case: Final renders, client work
Partial Caching (Critical Steps Only)
cache_interval: 3
cache_depth: 2
start_step: 200
end_step: 800
- Speedup: 1.8-2.2x
- Quality loss: Minimal
- Use case: Preserve early structure, late details
First Block Cache (FBCache) Configuration
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
residual_diff_threshold |
float | 0.05 | Max feature difference to trigger cache reuse (0.0-1.0) |
Usage
First Block Cache is not currently exposed as a standard per-generation toggle. The implementation is available in the codebase for specialized integration work:
# In src/user/pipeline.py
from src.WaveSpeed import fbcache_nodes
# Create cache context
cache_context = fbcache_nodes.create_cache_context()
# Apply caching to a Flux-style model
with fbcache_nodes.cache_context(cache_context):
patched_model = fbcache_nodes.create_patch_flux_forward_orig(
flux_model,
residual_diff_threshold=0.05, # Lower = stricter caching
)
# Generate images...
Tuning Threshold
- Lower threshold (0.01-0.03): Stricter caching, recomputes more often, higher quality
- Higher threshold (0.05-0.1): Looser caching, reuses more often, higher speedup
- Recommended: 0.05 (balances quality and speed)
Performance
Speedup Guidance
Speedup scales with cache interval and depth:
| Model | Cache Interval | Expected Behavior |
|---|---|---|
| SD1.5 | 2 | Moderate speedup, minimal quality loss |
| SD1.5 | 3 | Good speedup, slight quality loss |
| SD1.5 | 5 | High speedup, noticeable quality loss |
| SDXL | 3 | Good speedup, slight quality loss |
| Flux-style caching paths | implementation-specific | Depends on the integration path |
Performance varies based on: - GPU architecture - Model size - Resolution - Sampler choice - Number of steps
Recommendation: Start with interval=3 and adjust based on your quality requirements.### VRAM Impact
Caching increases VRAM usage slightly (50-200MB depending on resolution):
| Model | Baseline VRAM | + DeepCache | Increase |
|---|---|---|---|
| SD1.5 (768×512) | 3.2 GB | 3.4 GB | +200 MB |
| SDXL (1024×1024) | 6.8 GB | 7.0 GB | +200 MB |
| Flux (832×1216) | 12.5 GB | 12.6 GB | +100 MB |
Stacking with Other Optimizations
WaveSpeed is fully compatible with SageAttention, SpargeAttn and Stable-Fast:
DeepCache + SageAttention
deepcache_enabled: true
deepcache_interval: 3
# SageAttention auto-detected
Result: 2.2x (DeepCache) × 1.15 (SageAttention) = ~2.5x total speedup
DeepCache + SpargeAttn
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected
Result: Enhanced speedup from caching and sparse attention
DeepCache + Stable-Fast + SpargeAttn
stable_fast: true
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected
Result: Maximum combined speedup (all optimizations active, batch operations only)
Compatibility
DeepCache Compatible With
- ✅ Stable Diffusion 1.5
- ✅ Stable Diffusion 2.1
- ✅ SDXL
- ✅ All samplers (Euler, DPM++, etc.)
- ✅ LoRA adapters
- ✅ Textual inversion embeddings
- ✅ HiresFix
- ✅ ADetailer
- ✅ Multi-scale diffusion
- ✅ SageAttention/SpargeAttn
- ✅ Stable-Fast
DeepCache NOT Compatible With
- ❌ Flux models (use FBCache instead)
- ❌ Img2Img mode (can cause artifacts)
FBCache Compatible With
- ✅ Flux models
- ✅ SageAttention/SpargeAttn
- ✅ All Flux-compatible features
FBCache NOT Compatible With
- ❌ SD1.5/SDXL (use DeepCache instead)
- ❌ Stable-Fast (Flux not supported by Stable-Fast)
Troubleshooting
No Speedup Observed
Causes: 1. DeepCache disabled or not applied to correct model type 2. Cache interval too low (interval=1 provides no caching) 3. Model loaded incorrectly
Fixes:
# Check logs for DeepCache activation
cat logs/server.log | grep -i "deepcache\|cache"
# Verify UI toggle is enabled
# Streamlit: Check "Enable DeepCache" checkbox
# API: Ensure "deepcache_enabled": true in payload
# Try higher interval
deepcache_interval: 3 # Instead of 1 or 2
Quality Degradation
Symptoms: - Blurry details - Smoothed textures - Loss of fine patterns
Causes: 1. Cache interval too high 2. Cache depth too aggressive 3. Wrong model type (Flux using DeepCache)
Fixes:
# Reduce cache interval
deepcache_interval: 2 # Down from 5
# Reduce cache depth
deepcache_depth: 1 # Down from 3
# Disable caching for critical phases
deepcache_start_step: 200 # Skip early structure formation
deepcache_end_step: 800 # Skip late detail refinement
Artifacts in Img2Img
Symptom: Visible seams, inconsistent styles when using DeepCache with Img2Img.
Cause: Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency.
Fix: Disable DeepCache for Img2Img:
deepcache_enabled: false # When img2img_enabled: true
VRAM Increase
Symptom: OOM errors after enabling DeepCache.
Cause: Cached features consume additional VRAM.
Fixes:
1. Reduce batch size
2. Lower resolution
3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs)
4. Use lower cache depth:
yaml
deepcache_depth: 1 # Minimal caching
Flux FBCache Not Working
Symptom: No speedup with Flux generation.
Cause: FBCache implementation is more subtle — check logs for cache hit rate.
Debugging:
# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG
# Check cache statistics
cat logs/server.log | grep "cache"
If no cache hits, try adjusting threshold:
# In pipeline.py
residual_diff_threshold=0.1 # Increase from 0.05 for more cache reuse
Quality Comparison
Visual impact of different cache intervals:
| Interval | Speed | Visual Difference |
|---|---|---|
| Disabled | Baseline | Baseline (100% quality) |
| 2 | Faster | Virtually identical |
| 3 | Much faster | Very subtle smoothing |
| 5 | Very fast | Noticeable detail loss |
| 7+ | Fastest | Obvious quality degradation |
Recommendation: Start with interval=3 and adjust based on visual results.
Technical Details
DeepCache Implementation
Simplified pseudocode:
class DeepCacheWrapper:
def __init__(self, model, interval, depth):
self.model = model
self.interval = interval
self.cached_output = None
self.current_step = 0
def forward(self, x, timestep):
is_cache_step = (self.current_step % self.interval == 0)
if is_cache_step:
# Run full model, cache output
output = self.model(x, timestep)
self.cached_output = output.clone()
else:
# Reuse cached output (skip expensive computation)
output = self.cached_output
self.current_step += 1
return output
Actual implementation in src/WaveSpeed/deepcache_nodes.py includes:
- Proper timestep tracking
- Cache invalidation on batch changes
- Error handling and fallback to full forward
FBCache Residual Comparison
# Compute first block output
first_output = first_transformer_block(hidden_states)
# Compare to previous step
residual = first_output - previous_first_output
residual_norm = residual.abs().mean() / first_output.abs().mean()
if residual_norm < threshold:
# Feature change is small — reuse cached blocks
hidden_states = apply_cached_residual(first_output)
else:
# Feature change is large — recompute all blocks
hidden_states = run_remaining_blocks(first_output)
cache_residual(hidden_states)
Best Practices
For Everyday Use
- Enable DeepCache with default settings (
interval=3,depth=2) - Stack with SageAttention for 2.5x+ total speedup
- Disable for final client renders if absolute quality is critical
For Batch Processing
- Use aggressive caching (
interval=5,depth=3) - Pre-generate previews at high speed, re-render winners at full quality
- Disable TAESD previews to avoid overhead (set
enable_preview=false)
For Low VRAM
- Use conservative caching (
interval=2,depth=1) - Avoid stacking with Stable-Fast CUDA graphs
- Monitor VRAM via
/api/telemetryendpoint
Citation
If you use WaveSpeed/DeepCache in your work:
@inproceedings{ma2023deepcache,
title={DeepCache: Accelerating Diffusion Models for Free},
author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
booktitle={CVPR},
year={2024}
}
Resources
- DeepCache Paper
- DeepCache Repository
- ComfyUI DeepCache Implementation (reference for LightDiffusion-Next)
- First Block Cache Discussion