Develop an on-device RAG system powered by Gemma models

The EmbeddingGemma 300M variant from Google enables text embeddings to be generated on-device (e.g., mobile or laptop), which supports semantic search, retrieval, classification and clustering across 100+ languages. A previous blog post was created here which explains both the high-level Python workflow (via SentenceTransformers) and the lower-level Android implementation (using LiteRT, tokenisation, TFLite, etc.), emphasising how developers can load the model asset, tokenize, run inference, then compute cosine similarity on the resulting vectors, all without server dependency.

This post provides a step-by-step walkthrough for loading a PDF file, extracting and chunking its text, performing similarity matching, and using a Gemma 3 model to generate context-aware answers to user queries about the document.

Step 1 - Extract the text from the PDF file:
There is a library that you can use directly on mobile which is the IText Core one. It can extract text from a PDF file inside the assetts folder by easily running the below snippet:

context.assets.open(assetFileName).use { inputStream -> val pdfReader = PdfReader(inputStream) val pdfDocument = PdfDocument(pdfReader) val text = StringBuilder() val numberOfPages = pdfDocument.numberOfPages // Extract text from all pages (limit to first n pages to avoid overwhelming) val pagesToProcess = minOf(numberOfPages, 100) for (page in 1..pagesToProcess) { val pageText = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(page)) if (pageText.isNotBlank()) { text.append("## Page $page\n") text.append(pageText.trim()) text.append("\n\n") } } pdfDocument.close() val result = text.toString().trim() result.ifBlank { "[PDF Document: $assetFileName - No readable text content found]" }

}

Step 2 - Use the tokenizer to chunk the text into tokens (small chunks of text that AI models read or write). Tokenization is executed with the Deep Java Library API. Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. Each text chunk must be equal to or smaller than the maximum input size supported by the EmbeddingGemma model:

private fun loadTokenizer() { try { tokenizer = HuggingFaceTokenizer.newInstance(Paths.get("/data/local/tmp/tokenizer_embedding_300m.json")) Log.d("GemmaTokenizer", "Tokenizer loaded successfully.") } catch (e: Exception) { Log.e("GemmaTokenizer", "Failed to load tokenizer", e) } }.....val chunker = ChunkerHelper.RecursiveTextChunker( tokenizer = tokenizerAdapter, maxChunkTokens = 256, // Your target chunk size in TOKENS overlapTokens = 40, // Your target overlap in TOKENS separators = listOf("\n\n", "\n", ". ", " ", ""))// Create the chunksval chunks = chunker.createChunks(fileTextContent ?: "").....fun createChunks(text: String): List<String> { return chunkTextRecursively(text.trim(), separators)}/** * Recursively splits text into chunks of a desired size.*/private fun chunkTextRecursively(text: String, separators: List<String>): List<String> { val finalChunks = mutableListOf<String>() // 1. Base Case: If the text is small enough, return it as a single chunk. if (tokenizer.countTokens(text) <= maxChunkTokens) { return listOf(text) }// 2. Recursive Step: Try to split by the next available separator.val currentSeparator = separators.firstOrNull() if (currentSeparator == null) { // If no more separators, do a hard split. This is the final fallback. val hardChunks = mutableListOf<String>() for (i in 0 until text.length step maxChunkTokens) { hardChunks.add(text.substring(i, minOf(i + maxChunkTokens, text.length))) } return hardChunks }

......

Step 3 - Run the embedding model on each text chunk created in the previous step and store the resulting vectors in a file or a vector database:

val embeddingsMap = HashMap<String, FloatArray>()chunks.forEach { sentence -> val embedding = runEmbedding(sentence) if (embedding.isNotEmpty()) { embeddingsMap[sentence] = embedding Log.d( "EmbeddingLog", "Computed embedding for '$sentence': [${ embedding.take(10).joinToString(", ") }...]" ) }}.....ObjectOutputStream(FileOutputStream(embeddingsFile)).use { stream -> stream.writeObject(embeddingsMap)}

....

This step only needs to be performed once per PDF file, as it can be time-consuming. When the user opens the application again, the precomputed embeddings are already available for use.

Step 4 - Run the embedding model for every user’s input. EmbeddingGemma is used as a .tflite model with LiteRT framework:

private fun runEmbedding(query: String): FloatArray { if (tokenizer == null || interpreter == null) { Log.e("EmbeddingError", "Tokenizer or Interpreter not initialized.") return floatArrayOf() } val prompt = "task: search result | query: " val fullInput = prompt + query val encoding = tokenizer!!.encode(fullInput) val currentIds = encoding.ids val sequenceLength = 256 val truncatedIds = if (currentIds.size > sequenceLength) { currentIds.take(sequenceLength) } else { currentIds.toList() } val paddedIds = IntArray(sequenceLength) { 0 } for (i in truncatedIds.indices) { paddedIds[i] = truncatedIds[i].toInt() } val inputArray = arrayOf(paddedIds) val outputBuffer = TensorBuffer.createFixedSize( intArrayOf(1, 768), DataType.FLOAT32 ) try { interpreter?.run(inputArray, outputBuffer.buffer) return outputBuffer.floatArray } catch (e: Exception) { Log.e("EmbeddingError", "Failed to run TFLite interpreter", e) return floatArrayOf() }

}

Step 5 - Do the similarity check between the user’s input embeddings and every single embedding that we have stores at the previous step:

/** * Computes cosine similarity between two vectors.*/fun cosineSimilarity(vectorA: FloatArray, vectorB: FloatArray): Float { if (vectorA.size != vectorB.size) { throw IllegalArgumentException("Vectors must be of the same size") } var dotProduct = 0.0 var normA = 0.0 var normB = 0.0 for (i in vectorA.indices) { dotProduct += vectorA[i] * vectorB[i] normA += vectorA[i] * vectorA[i] normB += vectorB[i] * vectorB[i] } val magnitudeA = sqrt(normA) val magnitudeB = sqrt(normB) if (magnitudeA == 0.0 || magnitudeB == 0.0) { return 0.0f } return (dotProduct / (magnitudeA * magnitudeB)).toFloat()

}

Step 6 - Find which similarity is bigger and fetch the corresponding text:

// Sort the entire list by similarity in descending order and take the top 3.// This is efficient for moderately sized lists and easy to change (e.g., .take(3)).val topThreeMatches = allMatches.sortedByDescending { it.similarity }.take(3)// Prepare the final variables that you will use.val bestMatches: String// Handle the results based on how many matches were found.if (topThreeMatches.isNotEmpty()) { // We found at least one match. // Combine the text of the top matches into a single context string. // We add a separator to help the LLM distinguish between the different sources. bestMatches = topThreeMatches.joinToString(separator = "\n\n---\n\n") { it.text }

.....

Step 7 - Prepare the input text that will be passed to the LLM so it can generate an answer based on the specific context. Gemma 3 1B model is used for better results:

private fun loadLLM() { val taskOptions = LlmInferenceOptions.builder() .setModelPath("/data/local/tmp/Gemma3-1B-IT_seq128_q8_ekv1280.task") .setPreferredBackend(Backend.CPU) .setMaxTokens(MAX_TOKENS) // 1280 .build() llmInference = LlmInference.createFromOptions(this, taskOptions)}.....val inputPrompt = "You are a helpful assistant that responds to user query: ${query}, based ONLY on the context: ${bestMatches}.Use only text from the context. DO NOT offer any other help."Log.v("EmbeddingMatches", inputPrompt)var stringBuilder = "" llmInference?.generateResponseAsync(inputPrompt.take(MAX_TOKENS)) { partialResult, done -> if (partialResult != ".\\" && partialResult != "\\" && partialResult != "n" && partialResult != "\n") { stringBuilder += partialResult } Log.v("finished_res", stringBuilder) onResult(stringBuilder)

}

This is a video with the full application running offline:

Video demonstrating the implementation.