FAQ - llmedge

Q: Which models work with llmedge?

A: Any GGUF model supported by llama.cpp should work. Prefer quantized models (Q4_K_M, Q5_K_M, Q8_0) for on-device use. Models like SmolLM, TinyLlama, Qwen, Phi, and Gemma work well. Check model size against your device's available RAM.

Q: Can I run this on iOS or desktop?

A: The core native code uses llama.cpp (portable C++) but the Kotlin APIs target Android specifically. iOS ports would require Swift/Objective-C bindings. Desktop Java/Kotlin apps could potentially use JNI, but this hasn't been tested.

Q: How do I reduce memory usage?

A: Multiple strategies:

Use smaller, quantized models (Q4_K_M instead of Q8_0 or FP16)
Reduce contextSize in InferenceParams (e.g., 2048 instead of 8192)
Lower numThreads parameter
Call SmolLM.close() when done to free native memory
For Stable Diffusion, enable offloadToCpu and reduce image dimensions

Q: Why is inference slow on my device?

A: Several factors affect speed:

Mobile CPUs are resource-constrained compared to desktop GPUs
Use quantized models (Q4_K_M is faster than FP16)
Ensure you're on arm64-v8a architecture (check with Build.SUPPORTED_ABIS[0])
The library automatically selects optimized native libs based on CPU features
Lower temperature and maxTokens can speed up generation
Vulkan acceleration may help on supported devices

Q: How do I enable Vulkan acceleration?

A: Create SmolLM with SmolLM(useVulkan = true) (default). Check if enabled with smol.isVulkanEnabled(). Your device needs Android 11+ and Vulkan 1.2 support. See the Building section in the main README for build configuration.

Q: How do vision models work?

A: The library has interfaces for vision-capable LLMs (VisionModelAnalyzer), but full vision support is pending llama.cpp's multimodal integration for Android. Currently, use OCR (MlKitOcrEngine) for text extraction from images. The architecture is prepared for LLaVA-style models when available.

Q: Can I use custom embedding models for RAG?

A: Yes. Configure EmbeddingConfig when creating RAGEngine. Place your ONNX model and tokenizer in assets. The library uses sentence-embeddings library which supports various models from Hugging Face.

Q: How do I contribute?

A: See the Contributing page for development and PR guidelines.

If your question isn't answered, please open an issue with logs and steps to reproduce.