Q: Which models work with llmedge?
A: Any GGUF model supported by llama.cpp should work. Prefer quantized models (Q4_K_M, Q5_K_M, Q8_0) for on-device use. Models like SmolLM, TinyLlama, Qwen, Phi, and Gemma work well. Check model size against your device's available RAM.
Q: Can I run this on iOS or desktop?
A: The core native code uses llama.cpp (portable C++) but the Kotlin APIs target Android specifically. iOS ports would require Swift/Objective-C bindings. Desktop Java/Kotlin apps could potentially use JNI, but this hasn't been tested.
Q: How do I reduce memory usage?
A: Multiple strategies:
- Use smaller, quantized models (Q4_K_M instead of Q8_0 or FP16)
- Reduce
contextSizeinInferenceParams(e.g., 2048 instead of 8192) - Lower
numThreadsparameter - Call
SmolLM.close()when done to free native memory - For Stable Diffusion, enable
offloadToCpuand reduce image dimensions
Q: Why is inference slow on my device?
A: Several factors affect speed:
- Mobile CPUs are resource-constrained compared to desktop GPUs
- Use quantized models (Q4_K_M is faster than FP16)
- Ensure you're on arm64-v8a architecture (check with
Build.SUPPORTED_ABIS[0]) - The library automatically selects optimized native libs based on CPU features
- Lower
temperatureandmaxTokenscan speed up generation - Vulkan acceleration may help on supported devices
Q: How do I enable Vulkan acceleration?
A: Create SmolLM with SmolLM(useVulkan = true) (default). Check if enabled with smol.isVulkanEnabled(). Your device needs Android 11+ and Vulkan 1.2 support. See the Building section in the main README for build configuration.
Q: How do vision models work?
A: The library has interfaces for vision-capable LLMs (VisionModelAnalyzer), but full vision support is pending llama.cpp's multimodal integration for Android. Currently, use OCR (MlKitOcrEngine) for text extraction from images. The architecture is prepared for LLaVA-style models when available.
Q: Can I use custom embedding models for RAG?
A: Yes. Configure EmbeddingConfig when creating RAGEngine. Place your ONNX model and tokenizer in assets. The library uses sentence-embeddings library which supports various models from Hugging Face.
Q: How do I contribute?
A: See the Contributing page for development and PR guidelines.
If your question isn't answered, please open an issue with logs and steps to reproduce.