Overview
Stable-Fast is a JIT compilation framework that optimizes Stable Diffusion UNet models by tracing execution, fusing operators and optionally capturing CUDA graphs. It can provide significant speedup for SD1.5/SDXL batch workflows with zero quality loss.
Unlike runtime attention optimizations (SageAttention, SpargeAttn), Stable-Fast performs ahead-of-time compilation on the first inference pass. The compiled model is cached and reused for subsequent generations with compatible shapes.
How It Works
Stable-Fast applies three optimization layers:
1. TorchScript Tracing
The first forward pass through the UNet is recorded into a static computational graph:
traced_model = torch.jit.trace(unet, example_inputs)
This eliminates Python interpreter overhead and enables downstream graph optimizations.
2. Operator Fusion
The traced graph undergoes pattern-based fusion:
- Conv + BatchNorm fusion: Merges normalization into convolution weights
- Activation fusion: Fuses ReLU/GELU/SiLU directly into linear/conv ops
- Memory layout optimization: Converts to channels-last format for faster conv execution
- Triton kernels: Replaces PyTorch ops with hand-tuned Triton implementations (if
enable_triton=True)
Example fusion:
# Before:
x = conv(input)
x = batch_norm(x)
x = relu(x)
# After:
x = fused_conv_bn_relu(input) # Single kernel launch
3. CUDA Graph Capture (Optional)
When enable_cuda_graph=True, the entire forward pass is captured as a static CUDA graph:
- Kernel launches are recorded once and replayed on subsequent runs
- Eliminates CPU launch overhead (~10-15% speedup)
- Requires fixed input shapes and batch sizes
Trade-off: Higher VRAM usage (~500MB for graph buffers) and less flexibility.
Installation
Windows/Linux (Manual)
Follow the official guide:
# Install from PyPI (recommended)
pip install stable-fast
# Or build from source for latest features
git clone https://github.com/chengzeyi/stable-fast
cd stable-fast
pip install -e .
Prerequisites: - PyTorch 2.0+ with CUDA support - xformers (optional but recommended) - Triton (optional for Triton kernel fusion)
Docker
Stable-Fast is included in the Docker image when INSTALL_STABLE_FAST=1:
docker-compose build --build-arg INSTALL_STABLE_FAST=1
Default is 0 (disabled) to reduce image size and build time.
Usage
Streamlit UI
Enable in the Performance section of the sidebar:
- Check Stable Fast
- Generate images — the first run compiles the model (30-60s delay)
- Subsequent generations reuse the cached compiled model
Visual indicator: The first generation shows "Compiling model..." in the progress bar.
REST API
Pass stable_fast: true in the request payload:
curl -X POST http://localhost:7861/api/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "a peaceful garden with cherry blossoms",
"width": 768,
"height": 512,
"num_images": 1,
"stable_fast": true
}'
Configuration
Stable-Fast behavior is controlled by CompilationConfig:
from sfast.compilers.diffusion_pipeline_compiler import CompilationConfig
config = CompilationConfig.Default()
config.enable_xformers = True # Use xformers attention
config.enable_cuda_graph = False # CUDA graphs (set True for max speed)
config.enable_jit_freeze = True # Freeze traced graph
config.enable_cnn_optimization = True # Conv fusion
config.enable_triton = False # Triton kernels (experimental)
config.memory_format = torch.channels_last # Optimize memory layout
LightDiffusion-Next uses sensible defaults (CUDA graphs disabled by default for flexibility). To override:
# In src/StableFast/StableFast.py
def gen_stable_fast_config(enable_cuda_graph=False):
config = CompilationConfig.Default()
config.enable_cuda_graph = enable_cuda_graph # Pass True for max speed
# ... rest of config
Performance
Speedup Benchmarks
Stable-Fast provides speedup through: - JIT compilation: Eliminates Python overhead - Operator fusion: Reduces kernel launches - CUDA graphs (optional): Further reduces CPU overhead
Speedup varies significantly based on: - GPU architecture - Batch size and generation count - Model size (SD1.5 vs SDXL) - Whether CUDA graphs are enabled
Note: Performance benefits are most noticeable for batch operations (50+ images). For single 20-step generations, compilation overhead may exceed speedup gains.
Compilation Time
First-run compilation overhead:
- SD1.5 UNet: ~30s (traced once per resolution/batch size)
- SDXL UNet: ~60s (larger model)
- Subsequent runs: <1s (cached)
Cached compiled models persist in ~/.cache/torch_extensions/. Clear this directory to force recompilation.
Stacking with Other Optimizations
Stable-Fast is fully compatible with SageAttention, SpargeAttn and WaveSpeed:
Stable-Fast + SageAttention
stable_fast: true
# SageAttention auto-detected
Result: 70% (Stable-Fast) + 15% (SageAttention) = ~2x total speedup
Stable-Fast + SpargeAttn
stable_fast: true
# SpargeAttn auto-detected
Result: 70% (Stable-Fast) + 40% (SpargeAttn) = ~2.4x total speedup
Stable-Fast + SpargeAttn + DeepCache
stable_fast: true
deepcache:
enabled: true
interval: 3
depth: 2
# SpargeAttn auto-detected
Result: 70% × 40% × 150% (DeepCache 2-3x) = ~4-5x total speedup
Compatibility
Compatible With
- ✅ Stable Diffusion 1.5
- ✅ Stable Diffusion 2.1
- ✅ SDXL
- ✅ All samplers (Euler, DPM++, etc.)
- ✅ LoRA adapters
- ✅ Textual inversion embeddings
- ✅ HiresFix
- ✅ ADetailer
- ✅ Img2Img (with fixed denoise strength)
- ✅ SageAttention/SpargeAttn
- ✅ WaveSpeed caching
Not Compatible With
- ❌ Flux models (different architecture, no UNet)
- ❌ Dynamic resolution changes after compilation
- ❌ Dynamic batch size changes after compilation (with CUDA graphs)
- ⚠️ Frequent model switching (recompiles each time)
Troubleshooting
Slow First Run / Repeated Recompilation
Symptom: Every generation triggers compilation, even with identical settings.
Causes: 1. Cache directory not writable 2. System clock incorrect (invalidates timestamps) 3. Different model loaded (each model is cached separately)
Fixes:
# Check cache permissions
ls -la ~/.cache/torch_extensions
# Ensure stable timestamps
date # Should be correct
# Mount cache in Docker to persist across container restarts
docker run -v ~/.cache/torch_extensions:/root/.cache/torch_extensions ...
CUDA Out of Memory During Compilation
Symptom: OOM error on first run but not subsequent runs.
Cause: Compilation allocates temporary buffers for tracing.
Fixes:
1. Disable CUDA graphs: enable_cuda_graph=False (saves ~500MB)
2. Reduce batch size temporarily for first run
3. Clear other VRAM consumers (close other apps, disable model caching)
Compilation Hangs or Crashes
Symptom: Process freezes during "Compiling model..." step.
Causes:
1. Triton compilation error (if enable_triton=True)
2. Driver incompatibility
3. Insufficient CPU RAM for graph analysis
Fixes:
# Disable Triton
# In src/StableFast/StableFast.py:
config.enable_triton = False
# Update NVIDIA driver
nvidia-smi # Check version, upgrade if < 525.x
# Increase Docker memory limit
# In docker-compose.yml:
deploy:
resources:
limits:
memory: 16G # Increase from default
Error: torch.jit.trace fails
Symptom: RuntimeError: Could not trace model
Cause: Dynamic control flow in model (if/else statements depending on runtime values).
Fix: This is rare with standard SD models. If it occurs: 1. Check for custom LoRA/embeddings with dynamic logic 2. Disable Stable-Fast for that specific generation 3. Report issue with model details
Model Quality Degradation
Symptom: Compiled model produces different outputs than baseline.
Cause: Numeric precision differences from operator fusion (very rare).
Fixes:
# Disable aggressive optimizations
config.enable_cnn_optimization = False
config.memory_format = None # Use default layout
If issue persists, disable Stable-Fast and file a bug report.
Advanced Configuration
Custom Compilation Config
Override defaults in src/StableFast/StableFast.py:
def gen_stable_fast_config(enable_cuda_graph=False):
config = CompilationConfig.Default()
# Maximum speed (higher VRAM usage)
config.enable_cuda_graph = True
config.enable_triton = True
config.prefer_lowp_gemm = True # Use FP16 matrix multiplies
# Balanced (recommended)
config.enable_cuda_graph = False
config.enable_triton = False
config.enable_cnn_optimization = True
# Debug (no optimizations)
config.enable_cuda_graph = False
config.enable_jit_freeze = False
config.enable_cnn_optimization = False
return config
Clear Cached Compilations
# Linux/Mac
rm -rf ~/.cache/torch_extensions
# Windows
del /s /q %USERPROFILE%\.cache\torch_extensions
# Docker (mount cache as volume)
docker run -v my_cache:/root/.cache/torch_extensions ...
docker volume rm my_cache # Clear cache
Profile Compilation
# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG
# Run generation and check logs
cat logs/server.log | grep "Stable"
Best Practices
Production Deployments
- Pre-compile models during startup with a warm-up request (only for batch/long-running services)
- Mount cache volume to persist compilations across container restarts
- Disable CUDA graphs if serving multiple batch sizes
- Enable CUDA graphs for fixed-resolution APIs with consistent high-volume traffic
- Disable Stable-Fast entirely for single-shot API endpoints (compilation overhead exceeds benefit)
Example warm-up:
# In startup script
def warmup_stable_fast(model, width=768, height=512):
"""Pre-compile model with dummy input."""
dummy_input = torch.randn(1, 4, height // 8, width // 8, device="cuda")
dummy_timestep = torch.tensor([999], device="cuda")
with torch.no_grad():
model(dummy_input, dummy_timestep, c={})
print("Stable-Fast compilation complete")
Development Workflows
- Disable Stable-Fast when experimenting with new models/LoRAs (avoids repeated recompilation)
- Enable for final testing to verify production performance
- Clear cache after upgrading PyTorch/CUDA drivers
Citation
If you use Stable-Fast in your work:
@misc{stable-fast,
author = {Cheng Zeyi},
title = {stable-fast: Fast Inference for Stable Diffusion},
year = {2023},
publisher = {GitHub},
url = {https://github.com/chengzeyi/stable-fast}
}