Android Journey Tests with Gemini: CI Setup & 11-Week Review

Android User Interface (UI) testing has a reputation problem. Engineers care about quality but the traditional approach to automated UI testing creates many problems. Tests referring to specific view IDs break when a designer moves a button or a developer refactors a component. Writing one test takes hours. Maintaining a suite of tests becomes someone’s full-time job.

When Android Journeys powered by Gemini was announced, most Android teams hadn’t heard of it. It was in a pre-release canary feature drop, not the stable Android Studio channel, and the documentation was sparse. We wanted to know if it actually worked for us: Google’s AI UI testing tool built into Android Studio that lets you write tests in plain English instead of actual code. Not in a controlled demo environment. In a real production codebase, Ignite, REA Group’s customer-facing Android app, with real Continuous Integration (CI) infrastructure and a real team that has other things to do.

Ignite is the self-service platform real estate agents, property managers, and principals use to manage listings, enquiries, inspections, and performance insights on the go. The experiment let us test whether AI-driven UI testing could speed up how we ship features and catch regressions earlier, enabling faster, more reliable delivery and fewer bugs reaching production.

We ran an 11-week experiment to find out. Google’s own error messages misled us, Firebase Test Lab couldn’t support it, CI setup consumed eight weeks, and manual QA hours didn’t drop. The team rated the experience 8.75 out of 10 and every person voted to keep going. This is why.

Before you read on

✍️ Writing journey tests in plain English is genuinely faster – what took half a day in Espresso takes under ten minutes
🚫 Firebase Test Lab, AWS Device Farm, and Docker-based emulators don’t support it – you’ll need direct ADB access to a cloud device
📊 10% flakiness (trending down), one real bug caught, manual QA hours unchanged – yet the whole team voted to continue

What Android Journeys Powered by Gemini actually does

Android Journeys is built into Android Studio Labs. Instead of writing

onView(withId(R.id.btn_login)).perform(click())

you write “tap the Sign In button and wait for the home screen.” Gemini reads a screenshot, figures out what to do, sends an ADB command, and repeats until the test passes or fails.

Two things make it different from any selector-based framework:

Resilience
The AI reads the screen visually, like a human would, so it doesn’t care if you renamed a component or redesigned a layout. We changed bottom tab labels, redesigned screens, and updated UI components throughout the experiment. The journey tests mostly kept passing.

Self-healing
If Gemini tries something and it doesn’t immediately work, it retries, scrolling, waiting, or approaching from a different angle. Our QA engineer put it well in the post-experiment survey: “I loved that it’s smart enough to go back and retry when steps fail the first time.” That’s not something any selector-based framework does.

On security: only screenshots and UI hierarchy are sent to Google Gemini Enterprise, no source code leaves your repo. Our security team reviewed the data flow before we started. It cleared. Full product overview in the Android Journeys documentation.

The first week taught us things nobody put in the docs

We onboarded the team within a few hours; the first non-obvious problem appeared almost immediately.

A journey test was supposed to verify that a tab was selected. The AI kept passing it even when the assertion was wrong, it was interpreting “verify the tab is selected” as “make the tab selected,” which is technically helpful but completely wrong from a testing perspective. Several attempts at rewording didn’t fix it. The phrase that eventually worked:

Verify that the Enquiries tab is selected without trying to select it.

Five words. Completely changed the behaviour.

The second lesson was more fundamental. With unit tests, a green result is trustworthy. With Android Journeys, ambiguous instructions return green anyway. Before you trust any passing result, you need to intentionally break the test first. Change something that should fail. Confirm it fails. Then trust your greens.

We called this red-green testing for journeys, and it became a team standard. It’s not obvious until you’ve been burned.

The underlying worry with any language model is that it invents results. That’s real. But it turned out to be manageable, the answer is careful test writing, not avoiding the tool altogether.

Writing the tests: what we learned by doing it

Over the following weeks the team built coverage across the major flows of the app.

Speed was genuinely surprising
A test that would have taken half a day in a traditional framework took under ten minutes to write. At one point a team member wrote a journey for adding and removing members from a team management screen, noted in the PR that it was generated by AI in one go, and it passed review with only minor tweaks. That’s not a typical testing experience.

But fast to write doesn’t mean easy to write well. PR reviews kept surfacing the same class of problems: tests that assumed consistent account data when the app’s content depends on what’s in the test account, assertions that would only work for one variant of a screen that has multiple states, instructions specific enough to break if a label changed. Our QA engineer’s review comments got very good at catching these. It became a skill, reading a journey description and spotting where the AI would go wrong on a bad day.

Mock data made the difference
Tests that depended on real API responses were fragile and annoying to debug. We added a developer menu trigger accessible from within the app, which let journey tests navigate into mock mode and set consistent data states. Seven different test states for one simple screen alone. That kind of coverage would have been impractical to write and maintain in a code-based framework.

An unexpected use case also emerged partway through
One engineer wanted to validate that analytics events were firing correctly, something the team had wanted for a long time with no practical way to build. The approach: write an Android Journey that navigates through a flow, opens the developer analytics event log via the developer menu, and verifies that expected events appear. It worked. A genuine regression would show up as events missing from the log.

If you take three things from this section, let them be:

Fast to write is not the same as well written. It takes weeks, not hours, to become good at writing journey descriptions. Expect your PR reviews to surface the same observations repeatedly early on, that’s the learning curve, not a flaw in the tool.
Manage your test data or it will manage you. Journey tests relying on actual API responses will waste your time. Invest early in mock data states, it earns itself almost immediately.
Tests will find uses you didn’t plan for. Once the team had a low-friction way to script app interactions in plain English, they started solving problems well outside the original goal. Budget for it.

CI was where things got genuinely hard

Journeys running locally is useful. Android Journeys running on CI automatically is the goal. Getting there took eight weeks. Here are the problems we didn’t expect.

⚠️ The authentication error that lied to us

Running the Gradle test command from a terminal outside Android Studio produced:

JourneyExecutionException: Failed to obtain credentials for establishing connection with backend. Make sure you are logged in to Android Studio and are connected to a network before re-trying. [Reason=AUTHENTICATION_FAILED]

The error was wrong
Valid credentials, GCP token logging fine on the terminal, the problem wasn’t authentication. It was one missing API: iamcredentials.googleapis.com wasn’t enabled on the GCP project. Enable it, everything works.

After we filed a detailed bug report with Google IssueTracker, they confirmed the error message was a bug in their code, fixed it in the next Journeys engine release, and updated their documentation to mention the API requirement. Our report prompted that change.

🚫 No Firebase Test Lab

Option	Result
Firebase Test Lab	❌ Doesn’t support Android Journeys, needs direct ADB, managed runner doesn’t expose it
AWS Device Farm	❌ Same constraint
Gradle Managed Devices / Docker	❌ Needs KVM hardware virtualisation, not available inside containers
Genymotion Cloud on AWS	✅ SSH-accessible virtual device, this worked

Not where we expected to land, but it worked.

🔧 Building the pipeline

The final setup, simplified:

Buildkite → Docker container → SSH to Genymotion Cloud device
→ Enable ADB on device
→ Disable lock screen and animations
→ SSH tunnel: local port 7777 → device port 5555
→ ADB connect through tunnel
→ Install APK → Run journeys → Capture logcat

Why port 7777?

Tunnelling to port 5555 triggers ADB’s emulator auto-detection heuristic, a phantom emulator-5554 device shows up alongside your real device, and Android Journeys requires exactly one connected device. A non-standard port breaks the heuristic. It took a while to figure out what was happening.

🔒 The lock screen

After all of that infra work, journeys still weren’t running. The Genymotion cloud device had its lock screen enabled by default. Gemini would start a test and find a locked screen instead of the app.

Fix: one SSH command.

locksettings set-disabled true

When the first CI run came through successfully, a real Android Journey, running on a real cloud device, triggered from Buildkite, the message to the team had genuine excitement in it. It had taken weeks to get there.

Login

Journeys running without authentication is a proof of concept. Journeys running with a real user logged in is the actual goal. On a headless CI runner, this needed a custom Python script and a shell script to drive the login flow via ADB commands. By the end of the experiment that was about 90% working.

Service accounts

Google Cloud’s Workload Identity Federation credentials didn’t work with the Android Journeys engine. The correct setup is:

Service account with roles/aiplatform.user
cloudaicompanion.googleapis.com and aiplatform.googleapis.com APIs enabled
No Gemini Code Assist Enterprise subscription required, that licensing model is for human users

We have raised the documentation gap with Google.

Making the output readable

The raw CI output isn’t useful to a human. Android Journeys produces JUnit XML for pass/fail status, protobuf logs containing Gemini’s reasoning per step, and a folder of screenshots, all in separate places, none of it connected.

We made a Python script to fix that. It generates a three-layer interactive HTML report:

Top layer, each test and its pass/fail status
Expand a test, each task Gemini attempted to complete
Expand a task, each individual attempt with the screenshot, the action Gemini took, its reasoning, and the raw ADB command it executed

The whole report, HTML, screenshots, and logs, bundles into a ZIP for CI artifact storage. Before we had it, we had to trawl through logs to understand what went wrong. Now we can see exactly what Gemini was doing when it made a mistake.

The numbers

Metric	Result
Tests written and merged	13, across 5 app areas
Team satisfaction	8.75/10, all 4 respondents would continue
Maintenance per sprint	✅ < 1 hour
Test flakiness	⚠️ ~10% (trending down)
Real bugs caught	1 (analytics event firing twice)
Manual QA hours reduced	0, yet

Manual regression hours didn’t change. Our QA engineer kept executing the full manual suite throughout the experiment. That won’t shift until journey tests cover all the flows manual testing currently covers, which they don’t yet, and those tests have a track record of catching regressions reliably on CI. Coverage first, then confidence. We’re building both.

What we’d tell a team starting this today

Enable this API first: iamcredentials.googleapis.com. You’ll get a misleading auth error that implies you’re not logged in. You are. The missing API is the actual problem.
Service account role for CI: roles/aiplatform.user is enough. No Gemini Code Assist Enterprise subscription needed for headless execution — that licensing model is for human users.
Firebase Test Lab doesn’t support Android Journeys. Plan for a remote device pool with direct ADB access. We landed on Genymotion Cloud on AWS.
Don’t tunnel ADB to port 5555. Use 7777 or similar, 5555 creates a phantom second device and Journeys requires exactly one.
Disable the device lock screen before you start. Run locksettings set-disabled true via SSH. Gemini will find a locked screen and go nowhere otherwise.
Always make a new test fail first. A passing result only means something once you’ve confirmed the test can fail. Break it intentionally, confirm it breaks, then trust your greens.
Describe intent, not steps. “Navigate to the property details screen” survives a UI redesign. “Tap the listing card in the third row” doesn’t.
Flakiness improves with practice. It’s not the tool, it’s the learning curve. Don’t judge it in week two.
On adopt now vs. wait: If you have innovation budget or a dedicated spike, start now. The CI infrastructure you’ll build is largely tool-agnostic, device setup, tunnelling, report generation would all transfer. If this is going into delivery sprints, give it six months.

What’s next and our read on this

The immediate priorities are closing the login automation gap, expanding test coverage, and shipping a reliable nightly CI run. Once we have that track record, manual regression scope can start reducing.

The bigger picture: the industry is moving away from selector-based tests that break on every refactor and toward intent-based descriptions that survive UI change. Android Journeys is currently the most mature implementation of that pattern in any mobile IDE. The core advantage, describing what a user does rather than navigating a view hierarchy, doesn’t erode as the tooling matures. The friction does.

One survey response stuck with us: AI writes the code, writes the journey test, runs it, iterates on failures, before a human reviews the result. We’re a few steps from that. The foundation is already there. Eleven weeks in, we believe it.

This experiment was run through FlowLab, REA Group’s experimentation hub where we get hands-on with emerging technologies to find new ways to boost productivity across the business. If you’re outside REA and working through the same Android Journeys CI setup challenges, Firebase compatibility, Genymotion tunnelling, service account configuration, we’re happy to compare notes.

Share

Swapnil Gupta

Swapnil is a Staff Software Engineer ( Mobile) at REA Group, where he shapes Android platform strategy and large-scale mobile architecture. He has been with REA Group for more than four years. With 12+ years of experience in mobile engineering, he combines deep technical expertise with a strong focus on business impact, building scalable platforms, engineering standards, and high-performing teams. His work focuses on Android, Kotlin, Jetpack Compose, and AI-driven developer workflows that improve engineering excellence.