This page explains how to use llmedge's Kotlin API. The library offers two layers of abstraction:
- High-Level API (
LLMEdgeManager): Recommended for most use cases. Handles loading, caching, threading, and memory management automatically. - Low-Level API (
SmolLM,StableDiffusion): For advanced users who need fine-grained control over model lifecycle and parameters.
Examples reference the llmedge-examples repo.
High-Level API (LLMEdgeManager)
The LLMEdgeManager singleton simplifies interactions by managing model instances and resources for you.
Text Generation
val response = LLMEdgeManager.generateText(
context = context,
params = LLMEdgeManager.TextGenerationParams(
prompt = "Write a haiku about Kotlin.",
modelId = "HuggingFaceTB/SmolLM-135M-Instruct-GGUF", // Optional: defaults to this model
modelFilename = "smollm-135m-instruct.q4_k_m.gguf"
)
)
Image Generation
Automatically handles model downloading (MeinaMix), caching, and memory-safe loading.
val bitmap = LLMEdgeManager.generateImage(
context = context,
params = LLMEdgeManager.ImageGenerationParams(
prompt = "A cyberpunk city street at night, neon lights <lora:detail_tweaker_lora_sd15:1.0>",
width = 512,
height = 512,
steps = 20,
// Optional: Specify LoRA model details.
// The example app automatically downloads 'imagepipeline/Detail-Tweaker-LoRA-SD1.5'
// and appends the appropriate LoRA tag to the prompt when the toggle is enabled.
loraModelDir = getExternalFilesDir("loras")?.absolutePath + "/detail-tweaker-lora-sd15",
loraApplyMode = StableDiffusion.LoraApplyMode.AUTO
)
)
Key Optimizations for Image Generation:
- EasyCache: Automatically detected and enabled by
LLMEdgeManagerfor supported Diffusion Transformer (DiT) models (e.g., Flux, Wan), significantly accelerating generation by reusing intermediate diffusion states. It is disabled for UNet-based models (like SD 1.5). - LoRA Support:
LLMEdgeManagercan be configured withloraModelDirandloraApplyModeto use Low-Rank Adaptation models for fine-tuning outputs. - Flash Attention: Automatically enabled for compatible image dimensions.
Video Generation (Wan 2.1)
Handles the complex multi-model loading (Diffusion, VAE, T5) and sequential processing required for video generation on mobile.
val frames = LLMEdgeManager.generateVideo(
context = context,
params = LLMEdgeManager.VideoGenerationParams(
prompt = "A robot dancing in the rain",
videoFrames = 16,
width = 512,
height = 512,
steps = 20,
cfgScale = 7.0f,
flowShift = 3.0f,
forceSequentialLoad = true // Recommended for devices with <12GB RAM
)
) { status, currentFrame, totalFrames ->
// Optional progress callback
Log.d("Video", "$status")
}
Vision Analysis
Analyze images using a Vision Language Model (VLM).
val description = LLMEdgeManager.analyzeImage(
context = context,
params = LLMEdgeManager.VisionAnalysisParams(
image = bitmap,
prompt = "What is in this image?"
)
)
OCR (Text Extraction)
Extract text using ML Kit.
val text = LLMEdgeManager.extractText(context, bitmap)
Speech-to-Text (Whisper)
Transcribe audio using the high-level API:
import io.aatricks.llmedge.LLMEdgeManager
// Simple transcription to text
val text = LLMEdgeManager.transcribeAudioToText(
context = context,
audioSamples = audioSamples // 16kHz mono PCM float32
)
// Full transcription with segments and timing
val segments = LLMEdgeManager.transcribeAudio(
context = context,
params = LLMEdgeManager.TranscriptionParams(
audioSamples = audioSamples,
language = "en", // null for auto-detect
translate = false // set true to translate to English
)
) { progress ->
Log.d("Whisper", "Progress: $progress%")
}
// Language detection
val lang = LLMEdgeManager.detectLanguage(context, audioSamples)
// Generate SRT subtitles
val srt = LLMEdgeManager.transcribeToSrt(context, audioSamples)
Streaming Transcription (Real-time Captioning)
For live transcription from a microphone or audio stream, use the streaming API:
import io.aatricks.llmedge.LLMEdgeManager
import kotlinx.coroutines.launch
// Create a streaming transcriber with sliding window approach
val transcriber = LLMEdgeManager.createStreamingTranscriber(
context = context,
params = LLMEdgeManager.StreamingTranscriptionParams(
stepMs = 3000, // Run transcription every 3 seconds
lengthMs = 10000, // Use 10-second audio windows
keepMs = 200, // Keep 200ms overlap for context
language = "en", // null for auto-detect
useVad = true // Skip silent audio
)
)
// Collect real-time transcription results
launch {
transcriber.start().collect { segment ->
runOnUiThread {
textView.append("${segment.text}\n")
}
}
}
// Feed audio samples from microphone (16kHz mono PCM float32)
audioRecorder.setOnAudioDataListener { samples ->
lifecycleScope.launch {
transcriber.feedAudio(samples)
}
}
// Stop when done
fun stopTranscription() {
transcriber.stop()
LLMEdgeManager.stopStreamingTranscription()
}
Streaming Parameters Explained:
| Parameter | Default | Description |
|---|---|---|
stepMs |
3000 | How often transcription runs (lower = faster updates) |
lengthMs |
10000 | Audio window size (longer = more accurate) |
keepMs |
200 | Overlap with previous window for context |
vadThreshold |
0.6 | Voice activity threshold (0.0-1.0) |
useVad |
true | Skip transcription during silence |
Preset Configurations:
- Fast captioning:
stepMs=1000, lengthMs=5000- Quick updates, lower accuracy - Balanced (default):
stepMs=3000, lengthMs=10000- Good tradeoff - High accuracy:
stepMs=5000, lengthMs=15000- Better accuracy, more delay
Text-to-Speech (Bark)
Generate speech using the high-level API:
import io.aatricks.llmedge.LLMEdgeManager
// Generate speech (model auto-downloads on first use)
val audio = LLMEdgeManager.synthesizeSpeech(
context = context,
params = LLMEdgeManager.SpeechSynthesisParams(
text = "Hello, world!",
nThreads = 8 // Use more threads for faster generation
)
) { step, progress ->
Log.d("Bark", "${step.name}: $progress%")
}
// Save directly to file
LLMEdgeManager.synthesizeSpeechToFile(
context = context,
text = "Hello, world!",
outputFile = File(cacheDir, "output.wav")
)
// Unload when done to free memory
LLMEdgeManager.unloadSpeechModels()
Low-Level API
Direct usage of SmolLM and StableDiffusion classes. Use this if you need to manage the model lifecycle manually (e.g., keeping a model loaded across multiple disparate activities) or require specific configuration not exposed by the Manager.
Core components
SmolLM— Kotlin front-end class that wraps native inference calls.GGUFReader— C++/JNI reader for GGUF model files.Whisper— Speech-to-text via whisper.cpp (JNI bindings).BarkTTS— Text-to-speech via bark.cpp (JNI bindings).- Vision helpers —
ImageUnderstanding,OcrEngine(withMlKitOcrEngineimplementation). - RAG helpers —
RAGEngine,VectorStore,PDFReader,EmbeddingProvider.
Basic LLM Inference
Load a GGUF model and run inference:
val smol = SmolLM()
smol.load(modelPath, InferenceParams(numThreads = 4, contextSize = 4096L))
val reply = smol.getResponse("Your prompt here")
smol.close() // Free native memory when done
See Examples for complete working code with coroutines and streaming.
Downloading Models from Hugging Face
Download and load models directly from Hugging Face Hub:
val download = smol.loadFromHuggingFace(
context = context,
modelId = "unsloth/Qwen3-0.6B-GGUF",
filename = "Qwen3-0.6B-Q4_K_M.gguf",
params = InferenceParams(contextSize = 4096L),
preferSystemDownloader = true,
onProgress = { downloaded, total -> /* update UI */ }
)
For Wan video models (multi-asset: diffusion, VAE and encoder), use:
val sdWan = StableDiffusion.loadFromHuggingFace(
context = context,
modelId = "wan/Wan2.1-T2V-1.3B",
preferSystemDownloader = true,
onProgress = { name, downloaded, total -> /* update progress */ }
)
Key features:
- Downloads are cached automatically
- Supports private repositories with
tokenparameter - Uses Android DownloadManager for large files to avoid heap pressure
- Auto-resolves model aliases and mirrors
- Context size auto-caps based on device heap (override via
InferenceParams)
See HuggingFaceDemoActivity example for a complete implementation with progress updates and error handling.
Reasoning Controls
Control "thinking" traces in reasoning-aware models:
// Disable thinking at load time
val params = InferenceParams(
thinkingMode = ThinkingMode.DISABLED,
reasoningBudget = 0
)
smol.load(modelPath, params)
// Toggle at runtime
smol.setThinkingEnabled(false) // disable
smol.setReasoningBudget(-1) // unrestricted
reasoningBudget = 0: thinking disabledreasoningBudget = -1: unrestricted (default)- The library auto-injects
/no_thinktags when disabled
Image Text Extraction (OCR)
Extract text from images using Google ML Kit:
val mlKitEngine = MlKitOcrEngine(context)
val result = mlKitEngine.extractText(ImageSource.FileSource(imageFile))
println("Extracted: ${result.text}")
Vision modes:
AUTO_PREFER_OCR: Try OCR first, fall back to visionAUTO_PREFER_VISION: Try vision first, fall back to OCRFORCE_MLKIT: ML Kit onlyFORCE_VISION: Vision model only
Use ImageUnderstanding to orchestrate between OCR and vision models with automatic fallback.
See ImageToTextActivity example for complete implementation including camera capture.
Vision Models (Low-Level)
The library has interfaces for vision-capable LLMs (LLaVA-style models):
interface VisionModelAnalyzer {
suspend fun analyze(image: ImageSource, prompt: String): VisionResult
fun hasVisionCapabilities(): Boolean
}
Status: Architecture is prepared, but native vision support from llama.cpp is still being integrated for Android. Currently use OCR for text extraction. See LlavaVisionActivity example for the prepared integration pattern.
Speech-to-Text (Whisper Low-Level)
Use Whisper directly for fine-grained control:
import io.aatricks.llmedge.Whisper
// Load model with options
val whisper = Whisper.load(
modelPath = "/path/to/ggml-base.bin",
useGpu = false
)
// Configure transcription parameters
val params = Whisper.TranscriptionParams(
language = "en", // null for auto-detect
translate = false, // translate to English
maxTokens = 0, // 0 = no limit
speedUp = false, // experimental 2x speedup
audioCtx = 0 // audio context size
)
// Transcribe (16kHz mono PCM float32)
val segments = whisper.transcribe(audioSamples, params)
segments.forEach { segment ->
println("[${segment.startTimeMs}-${segment.endTimeMs}ms] ${segment.text}")
}
// Utility functions
val srt = whisper.transcribeToSrt(audioSamples)
val lang = whisper.detectLanguage(audioSamples)
val isMultilingual = whisper.isMultilingual()
val modelType = whisper.getModelType()
whisper.close()
Model sources:
- HuggingFace:
ggerganov/whisper.cpp(ggml-tiny.bin, ggml-base.bin, ggml-small.bin) - Sizes: tiny (~75MB), base (~142MB), small (~466MB)
Text-to-Speech (Bark Low-Level)
Use BarkTTS directly:
import io.aatricks.llmedge.BarkTTS
// Load model
val tts = BarkTTS.load(
modelPath = "/path/to/bark-small_weights-f16.bin",
useGpu = false
)
// Generate audio with progress tracking
val audio = tts.generate(
text = "Hello, world!",
onProgress = { step, progress ->
// step: SEMANTIC, COARSE, or FINE
// progress: 0.0 to 1.0
Log.d("Bark", "${step.name}: ${(progress * 100).toInt()}%")
}
)
// AudioResult contains:
// - samples: FloatArray (32-bit PCM)
// - sampleRate: Int (typically 24000)
// - durationSeconds: Float
// Save as WAV
tts.saveAsWav(audio, File("/path/to/output.wav"))
tts.close()
Model sources:
- HuggingFace:
Green-Sky/bark-ggml(bark-small_weights-f16.bin, bark_weights-f16.bin) - Sizes: small (~843MB), full (~2.2GB)
Stable Diffusion (Image & Video Generation)
Generate images and video on-device using Stable Diffusion and Wan models:
Image Generation:
val sd = StableDiffusion.load(
context = context,
modelId = "Meina/MeinaMix",
offloadToCpu = true,
keepClipOnCpu = true,
// Optional: Load with LoRA
loraModelDir = "/path/to/your/lora/files", // Directory containing .safetensors
loraApplyMode = StableDiffusion.LoraApplyMode.AUTO
)
val bitmap = sd.txt2img(
GenerateParams(
prompt = "a cute cat <lora:your_lora_name:1.0>", // LoRA tag in prompt
width = 256, height = 256,
steps = 20, cfgScale = 7.0f,
// Optional: EasyCache parameters
easyCacheParams = StableDiffusion.EasyCacheParams(enabled = true, reuseThreshold = 0.2f)
)
)
sd.close()
Video Generation (Wan 2.1):
// Load Wan model (loads diffusion, VAE, and T5 encoder)
val sd = StableDiffusion.loadFromHuggingFace(
context = context,
modelId = "wan/Wan2.1-T2V-1.3B",
preferSystemDownloader = true
)
val frames = sd.txt2vid(
VideoGenerateParams(
prompt = "A cinematic shot of a robot walking",
width = 480, height = 480,
videoFrames = 16,
steps = 20
)
)
sd.close()
Memory management:
- Use small resolutions (128x128 or 256x256) on constrained devices
- Enable CPU offloading flags to reduce native memory pressure
- Always use
preferSystemDownloader = truefor model downloads - Monitor with
MemoryMetricsto avoid OOM
See StableDiffusionActivity example for complete implementation with error recovery and adaptive resolution.
Best Practices
Threading:
- For heavy CPU-bound native inference (StableDiffusion CPU generation, large LLM decoding), prefer
Dispatchers.Defaultso work schedules onto CPU-optimized thread pool sized to the device cores. - For blocking calls that wait on I/O (downloads, filesystem access, or JNI calls that wait on network/IO), prefer
Dispatchers.IO. - Update UI only via
withContext(Dispatchers.Main). - Call
.close()inonDestroy()to free native memory.
Note: Choosing the correct dispatcher depends on the workload. JNI calls that use CPU-bound native kernels (like txt2img) should use Default; calls that perform blocking I/O should use IO.
Optimization Strategies:
- Use quantized models (Q4_K_M) for lower memory footprint
- Enable CPU offloading for large models
- Close model instances when not in use
- Process images/video in batches with intermediate cleanup
- KV Cache: For LLMs, ensure
storeChats = trueto leverage KV cache for faster multi-turn conversations.
See also:
- Architecture for system design and flow diagrams
- Quirks & Troubleshooting for detailed JNI notes and debugging
- Examples for complete working code
API reference
Key methods:
LLMEdgeManager.generateText(...)— High-level text generationLLMEdgeManager.generateImage(...)— High-level image generationLLMEdgeManager.generateVideo(...)— High-level video generationSmolLM.load(modelPath: String, params: InferenceParams)— loads a GGUF model from a pathSmolLM.loadFromHuggingFace(...)— downloads and loads a model from Hugging FaceSmolLM.getResponse(query: String): String— runs blocking generation and returns complete textSmolLM.getResponseAsFlow(query: String): Flow<String>— runs streaming generationSmolLM.addSystemPrompt(prompt: String)— adds system prompt to chat historySmolLM.addUserMessage(message: String)— adds user message to chat historySmolLM.close()— releases native resources
High-Level Speech API (via LLMEdgeManager):
- LLMEdgeManager.transcribeAudioToText(context, audioSamples, language?) — simple audio transcription
- LLMEdgeManager.transcribeAudio(context, params, onProgress?) — full transcription with segments
- LLMEdgeManager.detectLanguage(context, audioSamples) — detect spoken language
- LLMEdgeManager.transcribeToSrt(context, audioSamples, language?) — generate SRT subtitles
- LLMEdgeManager.synthesizeSpeech(context, params, onProgress?) — generate speech from text
- LLMEdgeManager.synthesizeSpeechToFile(context, text, outputFile, onProgress?) — generate and save WAV
- LLMEdgeManager.unloadSpeechModels() — unload all speech models
Low-Level Speech API:
- Whisper.load(modelPath: String, useGpu: Boolean) — loads a Whisper model
- Whisper.loadFromHuggingFace(...) — downloads and loads Whisper from HuggingFace
- Whisper.transcribe(samples: FloatArray, params: TranscriptionParams) — transcribes audio
- Whisper.detectLanguage(samples: FloatArray) — detects spoken language
- Whisper.close() — releases native resources
- BarkTTS.load(modelPath: String, ...) — loads a Bark TTS model
- BarkTTS.loadFromHuggingFace(...) — downloads and loads Bark from HuggingFace
- BarkTTS.generate(text: String, params: GenerateParams) — generates audio from text
- BarkTTS.saveAsWav(audio: AudioResult, filePath: String) — saves audio to WAV file
- BarkTTS.close() — releases native resources
Vision & OCR:
- OcrEngine.extractText(image: ImageSource, params: OcrParams): OcrResult — extracts text from image
- ImageUnderstanding.process(image: ImageSource, mode: VisionMode, prompt: String?) — processes image with vision/OCR
Image & Video:
- StableDiffusion.txt2img(params: GenerateParams): Bitmap — generates an image
- StableDiffusion.txt2vid(params: VideoGenerateParams): List<Bitmap> — generates video frames
Refer to the llmedge-examples activities for complete, working code samples.