Home - llmedge

llmedge is a lightweight toolkit for running LLM inference, vision models, and multimodal utilities on-device (Android/native). It bundles JNI/C++ inference bindings powered by llama.cpp and stable-diffusion.cpp, Kotlin APIs for Android, and comprehensive example applications.

Highlights

Core Features:

Native C++ inference via llama.cpp (GGUF model support)
Kotlin API for Android with coroutines and Flow support
Automatic CPU feature detection (FP16, dotprod, SVE, i8mm)
Optional Vulkan acceleration for compatible devices
Memory-aware context size capping
Optimized Inference: KV Cache reuse for multi-turn conversations, significantly reducing latency for subsequent prompts.

Generative AI Capabilities:

Image Generation: Stable Diffusion integration for on-device image generation with:
- EasyCache: Automatically detected and enabled for supported models (DiT architecture) to accelerate generation.
- LoRA Support: Apply Low-Rank Adaptation models (e.g., for style transfer) with automatic downloading from Hugging Face.
Video Generation: Generate short video clips (4-64 frames) from text using Wan models with sequential loading for lower RAM usage.

Speech Capabilities:

Speech-to-Text (STT): Whisper.cpp integration for audio transcription with:
- Timestamp support for subtitles
- Language detection
- SRT subtitle generation
- Real-time streaming transcription
- Works well on mobile with tiny/base models
Text-to-Speech (TTS): Bark.cpp integration for neural speech synthesis
- High-quality voice generation

Multimodal Capabilities:

OCR: Google ML Kit Text Recognition integration
Image processing utilities with orientation handling
Vision model interfaces (prepared for LLaVA-style models)

RAG Pipeline:

PDF text extraction with PDFBox
Sentence embeddings via ONNX Runtime
Text chunking with configurable overlap
In-memory vector store with JSON persistence
Context-aware question answering

Hugging Face Integration:

Direct model downloads from HF Hub
Smart quantization selection
Private repository support with tokens
Large file handling via Android DownloadManager
Automatic caching and mirror resolution

Developer Experience:

Comprehensive example apps demonstrating all features
Built-in memory metrics and performance monitoring
Reasoning control API (thinking mode)
Streaming and blocking generation modes
Detailed documentation and troubleshooting guides

Quick links

Installation — Setup and build instructions
Usage — API guide and code patterns
Examples — Sample applications and snippets
Architecture — System design and flow diagrams
Quirks & Troubleshooting — Common issues and solutions
FAQ — Frequently asked questions
Contributing — Development guidelines

Getting Started

Get started by reading the Installation section, then explore the Usage guide for API details. Check out llmedge-examples for complete working applications.