This page explains how to use llmedge's Kotlin API. The library offers two layers of abstraction:
- High-Level API (
LLMEdge): Recommended for most use cases. It exposes instance-scoped clients for text, speech, image, vision, and RAG while keeping model resolution and cleanup explicit. - Low-Level API (
SmolLM,StableDiffusion): For advanced users who need fine-grained control over model lifecycle and parameters.
Examples reference the llmedge-examples repo.
High-Level API (LLMEdge)
Create an LLMEdge instance from an Android-aware coroutine scope, then use the domain clients exposed by the facade.
Text Generation
val edge = LLMEdge.create(context, viewModelScope)
val response = edge.text.generate(
prompt = "Write a haiku about Kotlin.",
model = ModelSpec.huggingFace(
repoId = "HuggingFaceTB/SmolLM-135M-Instruct-GGUF",
filename = "smollm-135m-instruct.q4_k_m.gguf",
),
)
The high-level text client defaults to batched blocking generation to reduce JNI overhead. Override it when needed:
val response = edge.text.generate(
prompt = "Summarize the latest release notes.",
maxTokens = 256,
batchSize = 12,
options = TextModelOptions(
numThreads = 8, // prompt/batch processing
generationThreads = 3, // single-token generation
),
)
Streaming uses smaller batched native chunks by default. This keeps UI updates responsive without crossing JNI once per token:
edge.text.stream(
prompt = "List the key takeaways.",
batchSize = 6,
options = TextModelOptions(
numThreads = 6,
generationThreads = 2,
),
).collect { event ->
if (event is TextStreamEvent.Chunk) {
appendToUi(event.value)
}
}
Default batch sizes are currently 8 for blocking generation and 4 for streaming. Passing
batchSize = 0 uses the configured default for the relevant path.
Image Generation
Handles model resolution and memory-safe loading through the edge.image client.
val edge = LLMEdge.create(context, viewModelScope)
val bitmap = edge.image.generate(
ImageGenerationRequest(
prompt = "A cyberpunk city street at night, neon lights <lora:detail_tweaker_lora_sd15:1.0>",
width = 512,
height = 512,
steps = 20,
loraModelDir = getExternalFilesDir("loras")?.absolutePath + "/detail-tweaker-lora-sd15",
loraApplyMode = StableDiffusion.LoraApplyMode.AUTO
),
)
Key Optimizations for Image Generation:
- EasyCache: Automatically enabled by
edge.imagefor supported Diffusion Transformer (DiT) models such as Flux, SD3, Wan, Qwen Image, and Z-Image. It remains disabled for classic UNet-based pipelines such as SD 1.5/SDXL. - LoRA Support:
ImageGenerationRequestacceptsloraModelDirandloraApplyModefor on-the-fly fine-tuning. - Flash Attention: Automatically enabled for compatible image dimensions.
Video Generation (Wan 2.1)
Handles the complex multi-model loading (Diffusion, VAE, T5) and sequential processing required for video generation on mobile.
val edge = LLMEdge.create(context, viewModelScope)
val request =
VideoGenerationRequest(
prompt = "A robot dancing in the rain",
videoFrames = 16,
width = 512,
height = 512,
steps = 20,
cfgScale = 7.0f,
flowShift = 3.0f,
forceSequentialLoad = true,
)
viewModelScope.launch {
edge.image.generateVideo(request).collect { event ->
when (event) {
is GenerationStreamEvent.Progress -> Log.d("Video", event.update.message)
is GenerationStreamEvent.Completed -> previewImageView.setImageBitmap(event.frames.first())
}
}
}
Vision Analysis
Analyze images using a Vision Language Model (VLM).
val edge = LLMEdge.create(context, viewModelScope)
val description = edge.vision.analyze(bitmap, "What is in this image?")
Vision analysis also exposes separate prompt and generation thread counts for the underlying SmolLM runtime:
val description = edge.vision.analyze(
image = bitmap,
prompt = "What is in this image?",
numThreads = 4,
generationThreads = 2,
)
OCR (Text Extraction)
Extract text using ML Kit.
val edge = LLMEdge.create(context, viewModelScope)
val text = edge.vision.extractText(bitmap)
Speech-to-Text (Whisper)
Transcribe audio using the high-level API:
val edge = LLMEdge.create(context, viewModelScope)
val text = edge.speech.transcribeToText(audioSamples)
val segments =
edge.speech.transcribe(
audioSamples = audioSamples,
params = Whisper.TranscribeParams(language = "en", translate = false),
)
val lang = edge.speech.detectLanguage(audioSamples)
Streaming Transcription (Real-time Captioning)
For live transcription from a microphone or audio stream, use the streaming API:
import kotlinx.coroutines.launch
val edge = LLMEdge.create(context, lifecycleScope)
val transcriber = edge.speech.createStreamingSession(
params = Whisper.StreamingParams(
stepMs = 3000, // Run transcription every 3 seconds
lengthMs = 10000, // Use 10-second audio windows
keepMs = 200, // Keep 200ms overlap for context
language = "en", // null for auto-detect
useVad = true // Skip silent audio
)
)
// Collect real-time transcription results
launch {
transcriber.events().collect { segment ->
runOnUiThread {
textView.append("${segment.text}\n")
}
}
}
// Feed audio samples from microphone (16kHz mono PCM float32)
audioRecorder.setOnAudioDataListener { samples ->
lifecycleScope.launch {
transcriber.feedAudio(samples)
}
}
// Stop when done
fun stopTranscription() {
transcriber.stop()
}
Streaming Parameters Explained:
| Parameter | Default | Description |
|---|---|---|
stepMs |
3000 | How often transcription runs (lower = faster updates) |
lengthMs |
10000 | Audio window size (longer = more accurate) |
keepMs |
200 | Overlap with previous window for context |
vadThreshold |
0.6 | Voice activity threshold (0.0-1.0) |
useVad |
true | Skip transcription during silence |
Preset Configurations:
- Fast captioning:
stepMs=1000, lengthMs=5000- Quick updates, lower accuracy - Balanced (default):
stepMs=3000, lengthMs=10000- Good tradeoff - High accuracy:
stepMs=5000, lengthMs=15000- Better accuracy, more delay
Text-to-Speech (Bark)
Generate speech using the high-level API:
val edge = LLMEdge.create(context, viewModelScope)
val audio = edge.speech.synthesize("Hello, world!")
audioPlayer.play(audio.samples, audio.sampleRate)
Low-Level API
Direct usage of SmolLM and StableDiffusion classes. Use this if you need to manage the model lifecycle manually (e.g., keeping a model loaded across multiple disparate activities) or require configuration not exposed by LLMEdge.
Core components
SmolLM— Kotlin front-end class that wraps native inference calls.GGUFReader— C++/JNI reader for GGUF model files.Whisper— Speech-to-text via whisper.cpp (JNI bindings).BarkTTS— Text-to-speech via bark.cpp (JNI bindings).- Vision helpers —
ImageUnderstanding,OcrEngine(withMlKitOcrEngineimplementation). - RAG helpers —
RAGEngine,VectorStore,PDFReader,EmbeddingProvider.
Basic LLM Inference
Load a GGUF model and run inference:
val smol = SmolLM()
smol.load(modelPath, InferenceParams(numThreads = 4, contextSize = 4096L))
val reply = smol.getResponse("Your prompt here")
smol.close() // Free native memory when done
Managed chat history with edge.text.session(...)
Use a Kotlin-managed chat session when you want bounded multi-turn history without relying on the native KV cache to retain earlier turns:
runBlocking {
val edge = LLMEdge.create(context, this)
val session =
edge.text.session(
model = ModelSpec.localFile(modelPath),
memory = ConversationWindow(maxTurns = 6, maxTokens = 4096, stripThinkTags = true),
systemPrompt = "You are a concise assistant.",
)
session.prepare()
val firstReply = session.reply("Explain KV cache in one paragraph.")
session.stream("Now summarize that in 3 bullets.").collect { event ->
if (event is TextStreamEvent.Chunk) {
print(event.value)
}
}
edge.close()
}
edge.text.session(...) keeps the transcript in Kotlin memory, replays only the active sliding
window, and strips older <think>...</think> traces before replaying assistant messages.
Use it when:
- reasoning-enabled models emit large
<think>...</think>blocks that would otherwise bloat native chat history - you need a bounded sliding window (
ConversationWindow) for long-running chats - you want streaming via
stream()while still persisting the completed assistant reply in Kotlin memory
Prefer plain SmolLM with storeChats = true only for tightly scoped native-KV-cache flows where
you explicitly want the model runtime to own all chat history.
See Examples for a focused session snippet, or LocalAssetDemoActivity for a complete app-level example.
Downloading Models from Hugging Face
Download and load models directly from Hugging Face Hub:
val download = smol.loadFromHuggingFace(
context = context,
modelId = "unsloth/Qwen3-0.6B-GGUF",
filename = "Qwen3-0.6B-Q4_K_M.gguf",
params = InferenceParams(contextSize = 4096L),
preferSystemDownloader = true,
onProgress = { downloaded, total -> /* update UI */ }
)
For Wan video models (multi-asset: diffusion, VAE and encoder), use:
val sdWan = StableDiffusion.loadFromHuggingFace(
context = context,
modelId = "wan/Wan2.1-T2V-1.3B",
preferSystemDownloader = true,
onProgress = { name, downloaded, total -> /* update progress */ }
)
Key features:
- Downloads are cached automatically
- Supports private repositories with
tokenparameter - Uses Android DownloadManager for large files to avoid heap pressure
- Auto-resolves model aliases and mirrors
- Context size auto-caps based on device heap (override via
InferenceParams)
See HuggingFaceDemoActivity example for a complete implementation with progress updates and error handling.
Reasoning Controls
Control "thinking" traces in reasoning-aware models:
// Disable thinking at load time
val params = InferenceParams(
thinkingMode = ThinkingMode.DISABLED,
reasoningBudget = 0
)
smol.load(modelPath, params)
// Toggle at runtime
smol.setThinkingEnabled(false) // disable
smol.setReasoningBudget(-1) // unrestricted
reasoningBudget = 0: thinking disabledreasoningBudget = -1: unrestricted (default)- The library auto-injects
/no_thinktags when disabled
Image Text Extraction (OCR)
Extract text from images using Google ML Kit:
val mlKitEngine = MlKitOcrEngine(context)
val result = mlKitEngine.extractText(ImageSource.FileSource(imageFile))
println("Extracted: ${result.text}")
Vision modes:
AUTO_PREFER_OCR: Try OCR first, fall back to visionAUTO_PREFER_VISION: Try vision first, fall back to OCRFORCE_MLKIT: ML Kit onlyFORCE_VISION: Vision model only
Use ImageUnderstanding to orchestrate between OCR and vision models with automatic fallback.
See ImageToTextActivity example for complete implementation including camera capture.
Vision Models (Low-Level)
The library has interfaces for vision-capable LLMs (LLaVA-style models):
interface VisionModelAnalyzer {
suspend fun analyze(image: ImageSource, prompt: String): VisionResult
fun hasVisionCapabilities(): Boolean
}
Status: Architecture is prepared, but native vision support from llama.cpp is still being integrated for Android. Currently use OCR for text extraction. See LlavaVisionActivity example for the prepared integration pattern.
Speech-to-Text (Whisper Low-Level)
Use Whisper directly for fine-grained control:
import io.aatricks.llmedge.Whisper
// Load model with options
val whisper = Whisper.load(
modelPath = "/path/to/ggml-base.bin",
useGpu = false
)
// Configure transcription parameters
val params = Whisper.TranscribeParams(
language = "en", // null for auto-detect
translate = false, // translate to English
tokenTimestamps = true,
beamSize = 1,
)
// Transcribe (16kHz mono PCM float32)
val segments = whisper.transcribe(audioSamples, params)
segments.forEach { segment ->
println("[${segment.startTimeMs}-${segment.endTimeMs}ms] ${segment.text}")
}
// Utility functions
val srt = whisper.generateSrt(segments)
val lang = whisper.detectLanguage(audioSamples)
val isMultilingual = whisper.isMultilingual()
val modelType = whisper.getModelType()
whisper.close()
Model sources:
- HuggingFace:
ggerganov/whisper.cpp(ggml-tiny.bin, ggml-base.bin, ggml-small.bin) - Sizes: tiny (~75MB), base (~142MB), small (~466MB)
Text-to-Speech (Bark Low-Level)
Use BarkTTS directly:
import io.aatricks.llmedge.BarkTTS
// Load model
val tts = BarkTTS.load(
modelPath = "/path/to/bark-small_weights-f16.bin",
temperature = 0.7f,
fineTemperature = 0.5f,
)
tts.setProgressCallback { step, progress ->
Log.d("Bark", "${step.name}: $progress%")
}
val audio = tts.generate("Hello, world!", BarkTTS.GenerateParams(nThreads = 4))
// AudioResult contains:
// - samples: FloatArray (32-bit PCM)
// - sampleRate: Int (typically 24000)
// - durationSeconds: Float
// Save as WAV
tts.saveAsWav(audio, File("/path/to/output.wav"))
tts.close()
Model sources:
- HuggingFace:
Green-Sky/bark-ggml(bark-small_weights-f16.bin, bark_weights-f16.bin) - Sizes: small (~843MB), full (~2.2GB)
Stable Diffusion (Image & Video Generation)
Generate images and video on-device using Stable Diffusion and Wan models:
Image Generation:
val sd = StableDiffusion.load(
context = context,
modelId = "Meina/MeinaMix",
offloadToCpu = true,
keepClipOnCpu = true,
// Optional: Load with LoRA
loraModelDir = "/path/to/your/lora/files", // Directory containing .safetensors
loraApplyMode = StableDiffusion.LoraApplyMode.AUTO
)
val bitmap = sd.txt2img(
GenerateParams(
prompt = "a cute cat <lora:your_lora_name:1.0>", // LoRA tag in prompt
width = 256, height = 256,
steps = 20, cfgScale = 7.0f,
// Optional: EasyCache parameters
easyCacheParams = StableDiffusion.EasyCacheParams(enabled = true, reuseThreshold = 0.2f)
)
)
sd.close()
Video Generation (Wan 2.1):
// Load Wan model (loads diffusion, VAE, and T5 encoder)
val sd = StableDiffusion.loadFromHuggingFace(
context = context,
modelId = "wan/Wan2.1-T2V-1.3B",
preferSystemDownloader = true
)
val frames = sd.txt2vid(
VideoGenerateParams(
prompt = "A cinematic shot of a robot walking",
width = 480, height = 480,
videoFrames = 16,
steps = 20
)
)
sd.close()
Memory management:
- Use small resolutions (128x128 or 256x256) on constrained devices
- Enable CPU offloading flags to reduce native memory pressure
- Always use
preferSystemDownloader = truefor model downloads - Monitor with
MemoryMetricsto avoid OOM
See StableDiffusionActivity example for complete implementation with error recovery and adaptive resolution.
Best Practices
Threading:
- Route blocking JNI/native work through
Dispatchers.IO(or the library inference dispatcher used byLLMEdge). - Reserve
Dispatchers.Defaultfor pure Kotlin/Java CPU work such as post-processing that does not block on JNI calls. - Update UI only via
withContext(Dispatchers.Main). - Call
.close()inonDestroy()to free native memory.
Optimization Strategies:
- Use quantized models (Q4_K_M) for lower memory footprint
- Enable CPU offloading for large models
- Close model instances when not in use
- Process images/video in batches with intermediate cleanup
- Prefer batched text generation (
batchSize > 1) for blocking calls that do not need token-level UI updates - Use different thread counts for prompt/batch work and single-token generation when tuning for big.LITTLE devices
- Text-model cache sizing is now refreshed from the native model/state footprint, so
textCacheMemoryMbis a meaningful guardrail instead of just a file-size hint - LLM chat memory:
storeChatsis deprecated but still available for tightly scoped low-level compatibility flows that intentionally keep chat state inside one native runtime.
- Use
edge.text.session(...)when you need bounded history replay or want to strip older reasoning traces before replay.
See also:
- Architecture for system design and flow diagrams
- Quirks & Troubleshooting for detailed JNI notes and debugging
- Examples for complete working code
API reference
Key methods:
LLMEdge.create(...)— creates the instance-based high-level facadeedge.text.generate(...)— high-level text generationedge.text.stream(...)— high-level text streamingedge.text.session(...)— creates a Kotlin-managed multi-turn chat sessionTextGenerationRequest.batchSize— blocking generation batch size (0= use configured default)edge.text.stream(..., batchSize = ...)/text.ChatSession.stream(..., batchSize = ...)— streaming batch size override (0= use configured default)TextModelOptions.numThreads/generationThreads— prompt/batch vs single-token thread countsedge.image.generate(...)— high-level image generationedge.image.generateVideo(...)— high-level video generationedge.speech.transcribe(...)— high-level speech-to-textedge.speech.synthesize(...)— high-level text-to-speechSmolLM.load(modelPath: String, params: InferenceParams)— loads a GGUF model from a pathSmolLM.loadFromHuggingFace(...)— downloads and loads a model from Hugging FaceSmolLM.getResponse(query: String): String— runs blocking generation and returns complete textSmolLM.getResponseAsFlow(query: String): Flow<String>— runs streaming generationSmolLM.getEstimatedNativeMemoryBytes()/getEstimatedStateMemoryBytes()— expose native model/state memory estimatesSmolLM.addSystemPrompt(prompt: String)— adds system prompt to chat historySmolLM.addUserMessage(message: String)— adds user message to chat historytext.ChatSession.reply(message: String): String— runs bounded multi-turn chat with Kotlin-managed historytext.ChatSession.stream(message: String): Flow<TextStreamEvent>— streams a bounded reply while persisting the final assistant turnConversationWindow(...)— configures sliding-window size, token budget, and reasoning strippingSmolLM.close()— releases native resources
High-Level Speech API (via LLMEdge):
- edge.speech.transcribeToText(audioSamples, model?, params?, loadOptions?) — simple audio transcription
- edge.speech.transcribe(audioSamples, model?, params?, loadOptions?) — full transcription with segments
- edge.speech.detectLanguage(audioSamples, model?, loadOptions?) — detect spoken language
- edge.speech.createStreamingSession(model?, params?, loadOptions?) — create a reusable streaming transcriber
- edge.speech.synthesize(text, model?, params?, loadOptions?) — generate speech from text
- edge.speech.synthesizeStream(text, model?, params?, loadOptions?) — stream speech generation events
Low-Level Speech API:
- Whisper.load(modelPath: String, useGpu: Boolean, flashAttn: Boolean = true, gpuDevice: Int = 0) — loads a Whisper model
- Whisper.loadFromHuggingFace(...) — downloads and loads Whisper from HuggingFace
- Whisper.transcribe(samples: FloatArray, params: TranscribeParams) — transcribes audio
- Whisper.detectLanguage(samples: FloatArray) — detects spoken language
- Whisper.close() — releases native resources
- BarkTTS.load(modelPath: String, ...) — loads a Bark TTS model
- BarkTTS.loadFromHuggingFace(...) — downloads and loads Bark from HuggingFace
- BarkTTS.generate(text: String, params: GenerateParams) — generates audio from text
- BarkTTS.saveAsWav(audio: AudioResult, filePath: String) — saves audio to WAV file
- BarkTTS.close() — releases native resources
Vision & OCR:
- OcrEngine.extractText(image: ImageSource, params: OcrParams): OcrResult — extracts text from image
- ImageUnderstanding.process(image: ImageSource, mode: VisionMode, prompt: String?) — processes image with vision/OCR
Image & Video:
- StableDiffusion.txt2img(params: GenerateParams): Bitmap — generates an image
- StableDiffusion.txt2vid(params: VideoGenerateParams): List<Bitmap> — generates video frames
Refer to the llmedge-examples activities for complete, working code samples.