Overview

LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides.

Optimization Stack Overview

The pipeline orchestrates six primary acceleration paths:

Technique	Type	Speedup	Quality Impact	Requirements
AYS Scheduler	Sampling schedule	~2x	None/Better	All models
Prompt Caching	Embedding cache	5-15%	None	All models
SageAttention	Attention kernel	Moderate	None	All CUDA GPUs
SpargeAttn	Sparse attention	Significant	Minimal	Compute 8.0-9.0
Stable-Fast	Graph compilation	Significant*	None	>8GB VRAM, batch jobs
WaveSpeed	Feature caching	High	Tunable	All models

*Speedup depends heavily on batch size and generation count

These optimizations work together — enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs.

Quick Comparison

AYS Scheduler

What it does: Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation.

When to use: - Always recommended for SD1.5, SDXL, and Flux models - Txt2Img generation - Production workflows where speed matters - Any scenario where you'd normally use 20+ steps

Trade-offs: Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required.

→ Full AYS Scheduler guide

Prompt Caching

What it does: Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed.

When to use: - Batch generation with same prompt - Testing different seeds or settings - Iterative prompt refinement - Any workflow with repeated prompts

Trade-offs: None — minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default.

→ Full Prompt Caching guide

SageAttention & SpargeAttn

What it does: Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations.

When to use: - Always enable SageAttention if available (no quality loss, pure speed gain) - SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100) - Both work seamlessly with all samplers, LoRAs and post-processing stages

Trade-offs: None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative).

→ Full SageAttention/SpargeAttn guide

CFG Samplers

CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG.

Multi-Scale Diffusion

Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas.

When to use: - High-resolution generation (>1024px) - When memory is limited - For faster previews

Trade-offs: May reduce detail in fine areas.

Note: In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process).

Stable-Fast

What it does: JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead.

When to use: - Systems with >8GB VRAM (preferably 12GB+) - Batch jobs or workflows generating 50+ images with identical settings - Long-running operations where 30-60s compilation amortizes over time - Fixed resolutions and batch sizes

When NOT to use: - Normal 20-step single image generation (compilation overhead > speedup gains) - Systems with <8GB VRAM - Flux workflows (different architecture) - Quick prototyping or frequent model/resolution changes

Trade-offs: Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes.

→ Full Stable-Fast guide

WaveSpeed Caching

What it does: Exploits temporal redundancy in diffusion processes by caching high-level features in the UNet/Transformer architecture and reusing them across multiple denoising steps. Includes two strategies:

DeepCache — Caches middle/output block activations in UNet models (SD1.5, SDXL)
First Block Cache (FBCache) — Caches initial Transformer block outputs in Flux models

When to use: - Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup - Combine with conservative cache intervals (2-3) for minimal quality loss - Works alongside SageAttention and Stable-Fast

Trade-offs: Reduced fine detail if interval is too high, slight VRAM increase for cached tensors.

→ Full WaveSpeed guide

Priority & Fallback System

LightDiffusion-Next automatically selects the best available attention backend at runtime:

SpargeAttn > SageAttention > xformers > PyTorch SDPA

If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting LD_DISABLE_SAGE_ATTENTION=1 for debugging.

Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API.

Recommended Configurations

Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images)

stable_fast: true  # Only for batch operations
sageattention: auto  # or spargeattn if available
deepCache:
  enabled: true
  interval: 3
  depth: 2

Expected: Maximum speedup for batch operations, some quality loss Note: Disable stable_fast for single 20-step generations

Balanced - Quick Generation (SD1.5, any VRAM)

scheduler: ays  # NEW: Use AYS for 2x speedup
steps: 10  # Reduced from 20 (same quality with AYS)
stable_fast: false  # Disabled for normal generations
sageattention: auto
prompt_cache_enabled: true  # Enabled by default
deepcache:
  enabled: true
  interval: 2
  depth: 1

Expected: ~2-3x speedup with minimal quality loss Note: AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images)

Quality-First (Flux)

scheduler: ays_flux  # NEW: Optimized for Flux models
steps: 10  # Reduced from 15 (same quality with AYS)
stable_fast: false  # not supported
sageattention: auto
prompt_cache_enabled: true
fbcache:
  enabled: true
  residual_threshold: 0.01  # strict caching

Expected: ~2x speedup with minimal quality impact

Production API - High Volume (>8GB VRAM)

stable_fast: true  # Only for sustained high-volume APIs
sageattention: auto
deepCache:
  enabled: false  # avoid variability across batch sizes
keep_models_loaded: true

Expected: Consistent latency for repeated identical requests Note: For low-volume or single-shot APIs, use stable_fast: false

Hardware-Specific Tips

RTX 30xx / 40xx (Ampere/Ada)

Enable SpargeAttn for best results
Stable-Fast only for batch jobs (disable for quick 20-step generations)
Stable-Fast + SpargeAttn + DeepCache stacks well for long operations
Watch VRAM — Stable-Fast graphs consume ~500MB

RTX 50xx (Blackwell)

SageAttention only (SpargeAttn support pending)
Stable-Fast works but recompiles for new CUDA arch
DeepCache is your best additional speedup

A100 / H100 (Datacenter)

SpargeAttn + Stable-Fast + aggressive WaveSpeed
Prefer larger batch sizes to amortize kernel overhead
Use CUDA graphs (enable_cuda_graph=True in Stable-Fast config)

Low VRAM (<8GB)

Always disable Stable-Fast (requires >8GB VRAM)
Use SageAttention (minimal overhead)
Enable DeepCache with conservative intervals
Set vae_on_cpu=True for HiRes workflows

Debugging & Profiling

Check which optimizations are active:

# View startup logs
cat logs/server.log | grep -i "using\|enabled"

# Sample output:
# Using SpargeAttn (Sparse + SageAttention) cross attention
# Using SpargeAttn (Sparse + SageAttention) in VAE
# Stable-Fast compilation enabled
# DeepCache active: interval=3, depth=2

Monitor telemetry:

curl http://localhost:7861/api/telemetry | jq '.vram_usage_mb, .average_latency_ms'

Disable individual optimizations to isolate issues:

export LD_DISABLE_SAGE_ATTENTION=1      # Forces PyTorch SDPA
export LD_DISABLE_STABLE_FAST=1         # Skips compilation
export LD_DISABLE_WAVESPEED=1           # Disables all caching