- 01Runtime target
- Android-native multimodal inference
- 02Backends
- OpenCL first, Vulkan fallback, CPU last
- 03Modalities
- LLM, vision, speech, image, and video utilities
- 04API shape
- Kotlin-first facade over JNI/C++
why it matters
- Backends fall back across OpenCL, Vulkan, and CPU instead of betting on one path working.
- It reuses runtimes and keeps context around, so short jobs don't pay native startup every time. That cost adds up on phones.
- On-device RAG, PDF reading, speech, and image generation all live behind one Android toolkit instead of separate demos.
visuals

engineering notes
llmedge is an Android toolkit for running AI models on the device: LLMs, image generation, speech, and embeddings, all behind one Kotlin API, with the native engines and their failure modes kept out of app code.
The problem
Running local AI on Android is mostly a systems problem:
- devices vary a lot between vendors
- GPU paths break differently on different phones
- model files are big and annoying to move around
- native runtimes are expensive to start up over and over
- and apps still want a clean Kotlin API
Wrapping a single backend doesn’t cover that. The runtime has to handle the model files, device checks, and fallback itself.
What I decided
- Kotlin handles the app-facing API, while JNI encapsulates the C++ code to keep the UI layer clean. Coroutines and Flow keep the public API normal even though the inside is heavily native.
- Backends fall back in a set order: OpenCL if the device can do it, Vulkan if not, CPU as the safe last resort. The coordinator checks what works and stops retrying paths that already failed on that phone.
- Runtimes get pooled and reused, so a session doesn’t re-init native code on every call. On phones that startup cost is most of a short job.
- Model files are the library’s job. Downloading, resuming, validating, and caching all happen in ModelRepository instead of every app redoing it.
What it can do
- LLM inference (GGUF via llama.cpp), with KV-cache reuse for multi-turn chat
- image generation through stable-diffusion.cpp
- speech-to-text and text-to-speech (whisper.cpp, bark.cpp)
- embeddings, RAG, and PDF reading via ONNX utilities
- example Android apps that exercise the whole API
Where it runs
EasyReader ships llmedge in production for on-device chapter summaries, so the user’s text never leaves the phone. The llmedge-examples repo has each feature as its own small Android app.