Home
llmedge is a lightweight toolkit for running LLM inference, vision models, and multimodal utilities on-device (Android/native). It bundles JNI/C++ inference bindings powered by llama.cpp and stable-diffusion.cpp, Kotlin APIs for Android, and comprehensive example applications.
Highlights
Core Features:
- Native C++ inference via llama.cpp (GGUF model support)
- Kotlin API for Android with coroutines and Flow support
- Automatic CPU feature detection (FP16, dotprod, SVE, i8mm)
- Optional Vulkan acceleration for compatible devices
- Memory-aware context size capping
- Optimized Inference: KV Cache reuse for multi-turn conversations, significantly reducing latency for subsequent prompts.
Generative AI Capabilities:
-
Image Generation: Stable Diffusion integration for on-device image generation with:
- EasyCache: Automatically detected and enabled for supported models (DiT architecture) to accelerate generation.
- LoRA Support: Apply Low-Rank Adaptation models (e.g., for style transfer) with automatic downloading from Hugging Face.
-
Video Generation: Generate short video clips (4-64 frames) from text using Wan models with sequential loading for lower RAM usage.
Speech Capabilities:
-
Speech-to-Text (STT): Whisper.cpp integration for audio transcription with:
- Timestamp support for subtitles
- Language detection
- SRT subtitle generation
- Real-time streaming transcription
- Works well on mobile with tiny/base models
-
Text-to-Speech (TTS): Bark.cpp integration for neural speech synthesis
- High-quality voice generation
Multimodal Capabilities:
- OCR: Google ML Kit Text Recognition integration
- Image processing utilities with orientation handling
- Vision model interfaces (prepared for LLaVA-style models)
RAG Pipeline:
- PDF text extraction with PDFBox
- Sentence embeddings via ONNX Runtime
- Text chunking with configurable overlap
- In-memory vector store with JSON persistence
- Context-aware question answering
Hugging Face Integration:
- Direct model downloads from HF Hub
- Smart quantization selection
- Private repository support with tokens
- Large file handling via Android DownloadManager
- Automatic caching and mirror resolution
Developer Experience:
- Comprehensive example apps demonstrating all features
- Built-in memory metrics and performance monitoring
- Reasoning control API (thinking mode)
- Streaming and blocking generation modes
- Detailed documentation and troubleshooting guides
Quick links
- Installation — Setup and build instructions
- Usage — API guide and code patterns
- Examples — Sample applications and snippets
- Architecture — System design and flow diagrams
- Quirks & Troubleshooting — Common issues and solutions
- FAQ — Frequently asked questions
- Contributing — Development guidelines
Getting Started
Get started by reading the Installation section, then explore the Usage guide for API details. Check out llmedge-examples for complete working applications.