Emilio Melis / aatricks
  • build 2024.12
  • stack · 7 components
  • published 2024-12-13

LightDiffusion-Next

A local diffusion setup focused on speed, model support, and the implementation details.

01Runtime shape
Local server, browser UI, and pipeline core
02Model support
SD1.5, SDXL, Flux, LoRAs, and enhancement passes
03Performance focus
Xformers, BFloat16, WaveSpeed, and Stable-Fast
04Product surface
Queue, history, presets, previews, uploads, and API

why it matters

  • About 30% less inference time than the open-source baselines, and it got into the Ready Tensor CV Projects Expo 2024.
  • It works as a browser app and as lower-level execution you can drive directly.
  • The speedups (Xformers, BFloat16, WaveSpeed, Stable-Fast) are wired into the execution path.

engineering notes

LightDiffusion-Next is a local image-generation system built around a pipeline core, a queueing server, and a web interface. The project includes a complete setup guide, API docs, and a breakdown of the architecture and performance optimizations.

The speed work

The first version of this project measured about 30% less inference time than open-source baselines, earning a spot in the Ready Tensor CV Projects Expo 2024. The speedup comes from:

  • Scheduler optimizations: reworking the sampling loop instead of using the stock reference implementation.
  • VRAM tensor management: controlling exactly where and when tensors are allocated in memory instead of delegating it to the framework.

It also wires Xformers, BFloat16, WaveSpeed, and Stable-Fast into the execution path.

The workflow

It’s built to be used repeatedly, with:

  • prompt and negative prompt, presets, and generation modes
  • enhancement passes: Hires-Fix, ADetailer, prompt enhancement, img2img
  • queue, history, output previews, and uploads
  • a REST API and deployment paths (including a hosted HuggingFace Space)

Architecture

The main pieces:

  • generation settings go through one shared pipeline context, no per-UI branches
  • model families (SD1.5, SDXL, Flux, LoRAs) get assembled from diffusion, encoder, and VAE pieces
  • long jobs are queueable, not blocking calls
  • the frontend is kept separate from the pipeline