LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides.
Optimization Stack Overview
The pipeline orchestrates six primary acceleration paths:
| Technique | Type | Speedup | Quality Impact | Requirements |
|---|---|---|---|---|
| AYS Scheduler | Sampling schedule | ~2x | None/Better | All models |
| Prompt Caching | Embedding cache | 5-15% | None | All models |
| SageAttention | Attention kernel | Moderate | None | All CUDA GPUs |
| SpargeAttn | Sparse attention | Significant | Minimal | Compute 8.0-9.0 |
| Stable-Fast | Graph compilation | Significant* | None | >8GB VRAM, batch jobs |
| WaveSpeed | Feature caching | High | Tunable | All models |
*Speedup depends heavily on batch size and generation count
These optimizations work together — enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs.
Quick Comparison
AYS Scheduler
What it does: Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation.
When to use: - Always recommended for SD1.5, SDXL, and Flux models - Txt2Img generation - Production workflows where speed matters - Any scenario where you'd normally use 20+ steps
Trade-offs: Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required.
Prompt Caching
What it does: Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed.
When to use: - Batch generation with same prompt - Testing different seeds or settings - Iterative prompt refinement - Any workflow with repeated prompts
Trade-offs: None — minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default.
SageAttention & SpargeAttn
What it does: Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations.
When to use: - Always enable SageAttention if available (no quality loss, pure speed gain) - SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100) - Both work seamlessly with all samplers, LoRAs and post-processing stages
Trade-offs: None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative).
→ Full SageAttention/SpargeAttn guide
CFG Samplers
CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG.
Multi-Scale Diffusion
Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas.
When to use: - High-resolution generation (>1024px) - When memory is limited - For faster previews
Trade-offs: May reduce detail in fine areas.
Note: In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process).
Stable-Fast
What it does: JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead.
When to use: - Systems with >8GB VRAM (preferably 12GB+) - Batch jobs or workflows generating 50+ images with identical settings - Long-running operations where 30-60s compilation amortizes over time - Fixed resolutions and batch sizes
When NOT to use: - Normal 20-step single image generation (compilation overhead > speedup gains) - Systems with <8GB VRAM - Flux workflows (different architecture) - Quick prototyping or frequent model/resolution changes
Trade-offs: Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes.
WaveSpeed Caching
What it does: Exploits temporal redundancy in diffusion processes by caching high-level features in the UNet/Transformer architecture and reusing them across multiple denoising steps. Includes two strategies:
- DeepCache — Caches middle/output block activations in UNet models (SD1.5, SDXL)
- First Block Cache (FBCache) — Caches initial Transformer block outputs in Flux models
When to use: - Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup - Combine with conservative cache intervals (2-3) for minimal quality loss - Works alongside SageAttention and Stable-Fast
Trade-offs: Reduced fine detail if interval is too high, slight VRAM increase for cached tensors.
Priority & Fallback System
LightDiffusion-Next automatically selects the best available attention backend at runtime:
SpargeAttn > SageAttention > xformers > PyTorch SDPA
If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting LD_DISABLE_SAGE_ATTENTION=1 for debugging.
Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API.
Recommended Configurations
Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images)
stable_fast: true # Only for batch operations
sageattention: auto # or spargeattn if available
deepCache:
enabled: true
interval: 3
depth: 2
Expected: Maximum speedup for batch operations, some quality loss Note: Disable stable_fast for single 20-step generations
Balanced - Quick Generation (SD1.5, any VRAM)
scheduler: ays # NEW: Use AYS for 2x speedup
steps: 10 # Reduced from 20 (same quality with AYS)
stable_fast: false # Disabled for normal generations
sageattention: auto
prompt_cache_enabled: true # Enabled by default
deepcache:
enabled: true
interval: 2
depth: 1
Expected: ~2-3x speedup with minimal quality loss Note: AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images)
Quality-First (Flux)
scheduler: ays_flux # NEW: Optimized for Flux models
steps: 10 # Reduced from 15 (same quality with AYS)
stable_fast: false # not supported
sageattention: auto
prompt_cache_enabled: true
fbcache:
enabled: true
residual_threshold: 0.01 # strict caching
Expected: ~2x speedup with minimal quality impact
Production API - High Volume (>8GB VRAM)
stable_fast: true # Only for sustained high-volume APIs
sageattention: auto
deepCache:
enabled: false # avoid variability across batch sizes
keep_models_loaded: true
Expected: Consistent latency for repeated identical requests
Note: For low-volume or single-shot APIs, use stable_fast: false
Hardware-Specific Tips
RTX 30xx / 40xx (Ampere/Ada)
- Enable SpargeAttn for best results
- Stable-Fast only for batch jobs (disable for quick 20-step generations)
- Stable-Fast + SpargeAttn + DeepCache stacks well for long operations
- Watch VRAM — Stable-Fast graphs consume ~500MB
RTX 50xx (Blackwell)
- SageAttention only (SpargeAttn support pending)
- Stable-Fast works but recompiles for new CUDA arch
- DeepCache is your best additional speedup
A100 / H100 (Datacenter)
- SpargeAttn + Stable-Fast + aggressive WaveSpeed
- Prefer larger batch sizes to amortize kernel overhead
- Use CUDA graphs (
enable_cuda_graph=Truein Stable-Fast config)
Low VRAM (<8GB)
- Always disable Stable-Fast (requires >8GB VRAM)
- Use SageAttention (minimal overhead)
- Enable DeepCache with conservative intervals
- Set
vae_on_cpu=Truefor HiRes workflows
Debugging & Profiling
Check which optimizations are active:
# View startup logs
cat logs/server.log | grep -i "using\|enabled"
# Sample output:
# Using SpargeAttn (Sparse + SageAttention) cross attention
# Using SpargeAttn (Sparse + SageAttention) in VAE
# Stable-Fast compilation enabled
# DeepCache active: interval=3, depth=2
Monitor telemetry:
curl http://localhost:7861/api/telemetry | jq '.vram_usage_mb, .average_latency_ms'
Disable individual optimizations to isolate issues:
export LD_DISABLE_SAGE_ATTENTION=1 # Forces PyTorch SDPA
export LD_DISABLE_STABLE_FAST=1 # Skips compilation
export LD_DISABLE_WAVESPEED=1 # Disables all caching
Further Reading
- AYS Scheduler Deep Dive — Theory, implementation, quality tuning
- Prompt Caching Deep Dive — Implementation details, cache management, performance impact
- SageAttention & SpargeAttn Deep Dive — Installation, technical details, head dimension handling
- Stable-Fast Compilation Guide — Configuration, CUDA graphs, troubleshooting
- WaveSpeed Caching Strategies — DeepCache vs FBCache, tuning parameters, compatibility matrix
- Performance Tuning — VRAM management, slow first runs, recompilation fixes
Armed with this overview, dive into the technique-specific guides or experiment directly in the UI to find your optimal speed/quality balance.