Give an LLM an API and It'll Thrive. Give It a Touchscreen and It'll Miss.

The moment this project became interesting was not when a model succeeded. It was when I watched one look straight at the airplane mode toggle, explain that it was about to toggle it, miss, read the screen again on the wrong dialog, then confidently state it succeeded. I've spent years debugging exactly that kind of problem at Google, Google X, and Toyota Research Institute, so this felt natural to me. The fix is always the same: stop guessing and write a deterministic test. So that's what I did.

I wanted a narrower answer to a practical question:

If you give a model only the actions a person has on a phone: look at the screen, tap, swipe, long-press, press a physical button, can it reliably complete small real tasks?

So I built a small Android harness to test exactly that.

Across 1,700+ runs on four stable tasks, I got two useful things out of it. First, I got clean pass rates on basic phone tasks that (in theory) could be used for reinforcement learning. Second, I got a result I care about much more about than a leaderboard, a potential regression: in my Codex runs, gpt-5.3-codex beat gpt-5.4 on three of the four stable tasks.

I do not work on model training or model evals at a lab, so I am not presenting this as an industry prescription. This is just a concrete system I built, the methodology I used, and the observations that fell out of it.

The model finds "Airplane mode", sees it's disabled, then mistakenly taps "SIMs" instead of the "Airplane mode" toggle:

Results

Comparing models head-to-head was not the goal. My testing practices were not very ridgid and are lacking a diverse set of tests and variations of the tests to compare against.

The most interesting finding was that gpt-5.3-codex outperformed gpt-5.4 on three of four tasks. This could be due to something I'm not accounting for, the time of day the tests were run, gpt-5.4 timing out more often, or other factors. These aren't the same model, the Codex variants are fine-tuned for agentic tool use, while GPT-5.4 is general-purpose. Whether that's a regression or a tradeoff, I can't say. But it's exactly the kind of thing task-level tests are supposed to catch.

Model Comparison
Lower is better for all metrics. Whiskers show 95% CI on fail rate. Tap/click a row for details.
How is this calculated?
Test Period
All runs were conducted between March 28 – April 2, 2026 using each provider's publicly available API.
Plans Used
Tests were run using the following subscription plans:
  • Anthropic — Max 20x plan
  • Google — AI Ultra plan
  • OpenAI — Pro plan
Cost Calculation
Cost = (input tokens × input price) + (output tokens × output price) per successful run. Costs are based on each provider's published API token pricing (input / output per million tokens), not subscription fees. Sources:
  • Anthropic — Opus: $5 / $25, Sonnet: $3 / $15 (source)
  • Google — Gemini 2.5 Pro: $1.25 / $10, Flash: $0.30 / $2.50, Flash Lite: $0.10 / $0.40 (source)
  • OpenAI — GPT-5.4: $2.50 / $15, GPT-5.2/5.3 Codex: $1.75 / $14, GPT-5.1 Codex: $1.25 / $10 (source)
Confidence Intervals
Error bars use the Wilson score interval at 95% confidence.
Timeouts
Each test has a 10-minute timeout. Runs that exceed this are marked as "Timed Out" (shown in the darker segment of the fail bar). Timeouts count as failures in the pass rate calculation.
Low Sample Threshold
Models with fewer than 50% of the max run count are separated into a "Low sample" group.
Duration & Token Metrics
Duration and token bars only average successful runs.

The rule: no superpowers

The core constraint was simple: the model should interact with the phone the way a person would.

That means the agent side of the system only gets a very small action surface (via an MCP bridge):

  • screenshots
  • taps
  • swipes
  • long-presses
  • a few physical button events
  • short waits for UI animations to settle

Under the hood, the bridge is backed by adb (Android Debug Bridge), but that is for the harness, the model itself does not have access to adb or any other super powers a human would not have.

The hardware button tool is intentionally narrow. On the device side I modeled the capabilities of a Pixel 8a. Since some Android devices have a "home", "back" and other similar buttons, adb does support emulating those clicks even though those buttons don't physically exist for Pixel 8a users, so I only gave access to "volume up/down" and power. This means, if the model wants to go home on a gesture-based Android device, it has to figure out the swipe.

That sounds like a small detail, but it matters. The whole point was to test interaction under something closer to human constraints, not if it could understand how to use APIs.

I also considered using AndroidWorld and minimally considered using AndroidEnv. I choose to write the entire stack myself because AndroidWorld is much larger than what I needed, it is designed to give models special permissions, like "open app by name" and the models have access to the UI accessibility tree. I wanted to keep it to only what a human can see and do. And finally, AndroidWorld does not use the MCP interface and uses the more legacy way of injecting instructions to the model telling it to respond in json.

The harness is intentionally thin

The architecture ended up being much simpler than people often assume these systems need to be:

  1. The harness creates a device session.
  2. It restores a clean baseline snapshot.
  3. It runs task-specific setup.
  4. It starts a screen recording.
  5. It hands the task prompt and session ID to the model.
  6. The model interacts through the MCP bridge.
  7. The harness stops recording and verifies the final device state with adb.
  8. The session is cleaned up.

That separation mattered.

The admin side exists to make tests deterministic: snapshot loading, setup commands, file installs, emulator console access, screen recording, and final verification.

The agent side exists to stay constrained: screenshots in, UI actions out.

I originally had much bigger plans for the whole project. The architecture supports running across multiple Android versions, varied starting states, and even swapping between models on a single emulator using snapshots. I even built a whole snapshotting system where the emulator would save and restore state as different models took turns on the same device. I implemented it, debugged it, and it functionally worked. But then I realized I didn't actually need it, I could run one model at a time and get clean results faster.

That is a pattern I fall into often: I scope toward the full destination, build the pieces, then pull back to what is actually needed right now. I personally believe this helps build good architecture and abstraction layers.

Agent surface Admin surface Test harness Orchestrates everything LLM MCP bridge screenshot, tap swipe, long-press key event, sleep Admin API snapshots, ADB shell recording, verification Android emulator invokes ADB controls ADB

Why I cared so much about binary pass/fail

I did not want a fuzzy score, no R² or similarity metrics, no second model grading the first. I wanted tests that end with a boring, deterministic answer.

  • Is airplane mode on? - Check settings get global airplane_mode_on.
  • Was the alarm created? - Check dumpsys alarm for a pending DeskClock alarm at 17:00.
  • Was Firefox Focus removed? - Check pm list packages and make sure org.mozilla.focus is gone.

That is the design philosophy in one sentence: if the alarm is not set, the task failed. There is no partial credit.

This matters because of how I think about tests generally. My personal philosophy is that a flaky test is a bad test. If a test sometimes passes and sometimes fails on the same code, it's not telling you anything useful, it's measuring non-determinism. In an ideal world, every test in your system runs the same way every time, and a failure means something actually broke.

In practice, that's not realistic for every layer of a complex system. In any large engineering organization, you end up with a testing hierarchy. At the lowest level, individual function tests, you push for near-perfect determinism. As you move up through module tests, integration tests, and end-to-end tests, the tolerance for variance increases because there are more moving parts, more developers touching code they didn't write, and more surface area for subtle environmental differences. You go from expecting six nines of reliability down to maybe three or two. That's not ideal, it's just practical.

The tests I built sit at the integration level. The model is a black box. It might take a different path to the same result each time. So I don't expect a single run to be deterministic, I expect the pass rate over many runs to be stable and meaningful. If a model passes a task 70% of the time over dozens of runs, that's a real signal.

But the verification itself has to be deterministic. It's a direct query to the device: is this setting on or off, is this package installed or not. The model's behavior can vary, the scoring cannot.

In practice I often wrote the verification path first, then worked backward into the task setup. If I could not verify a task cleanly, find another way.

The tasks are small on purpose

I started with turning on airplane mode. Swipe down from the top of the screen to open quick settings, find the airplane mode toggle, tap it. One or two gestures. Once I could see the model actually pulling down the notification shade and hitting the right icon, I knew the basic loop worked and I could try something harder.

Turning airplane mode off is a separate test and subtly different. When airplane mode is active the icon looks different, so it'd need to know how to identify if the icon is enabled or not.

From there I moved to setting an alarm for 5:00 PM, which is a much more complex interaction. The model has to find and open the clock app, navigate to the alarms tab, interact with a time picker, set the correct time, and save it. The time picker alone is interesting because it varies across Android versions (although the stats captured in this article are only on a single android version, I did see the model interacting with different time pickers in different android versions). Some show a clock face you tap on, others let you type the time directly, and the model has to deal with whichever one it gets.

The last task was uninstalling an app. This one was a bit more complex on the harness side, because the Android emulator doesn't let you uninstall any of the default apps, like YouTube and Chrome. I also didn't want to compile an APK myself, I wanted an open-source APK I could sideload and then ask the model to remove. Firefox Focus fit the bill. What I didn't anticipate was that the Firefox Focus icon is purple, not the orange-red most people associate with Firefox. To keep the test focused on the uninstall flow rather than grading the model on finding an icon that even a human might struggle to identify, I modified the prompt to tell the model the icon is purple. Even so, it still has to scan the screen, locate the right app among others, and match the description, it just doesn't have to guess as much on the app icon itself.

I originally wanted to push the model to the limits, but realized it was not very good at even doing simple tasks. This lead me to think about how these simple tasks could be used to either train and/or identify weaknesses in model capabilities.

The model realizes it is a clock and it correctly clicks on the "5" on the clock wheel, presses "ok", then validates it was set:

The test I excluded, and why it matters

There was a fifth test that didn't make the cut. I set up a simulated text message containing a six-digit verification code, and asked the model to open the Messages app, find the code, and tell me what it was. Conceptually, it's a great test, it combines notification awareness, app navigation, reading comprehension, and reporting information back.

When I first started running it, the results looked good. Models were passing at reasonable rates, and I moved on to running more iterations and adding more models, but once I was maybe 20-30% through the full run and started reviewing replays through the visualizer, and I noticed something. On some runs, a notification banner would pop up at the top of the screen with the message preview just after the test started, and if the model was fast enough, it could read the code directly from that banner or click on it just before it disappeared. On other runs, the banner was already gone by the time the model took its first screenshot, and it had to open the Messages app the long way. Both paths led to a correct answer, but the pass rate was partly measuring whether the model happened to take a screenshot while the notification was still visible. That's not what I was trying to test.

Once I saw this, I decided to exclude the test and move on. Rerunning all the data I'd already collected with a fixed version would have taken another eight to twelve hours of emulator time, and I didn't feel it would meaningfully change what I was trying to show. The other four tests were clean.

The reason I bring this up is not to explain away missing data. It's because it highlights the importance of writing debug tooling and not just relying on raw metrics being printed. The numbers looked fine, the test was passing at a reasonable rate. Without being able to watch what was actually happening, action by action that the model was performing, I would have published a flawed test with good-looking numbers and never known.

A bad test showing that the model was able to see the text message banner before it faded away capturing the verification code. This put a bias towards faster models, which was not the intended goal:

Scores tell you that something regressed. Replays show you how.

Early on I built a replay visualizer. It started as a debugging tool for myself, I needed to verify that my MCP server was actually sending the right screenshots to the model. The screen is large, the crop regions are dynamic, and there's a scaling step involved. Without being able to see exactly what pixels the model received, I was working blind.

The visualizer shows two views side by side. On the left, the actual device screen with overlays showing where the model tapped, swiped, or long-pressed. On the right, what the model actually saw, the cropped, scaled screenshot regions it requested, layered on top of each other over time. Below both, a timeline of the model's messages and reasoning, plus token usage over time.

This turned out to be one of the most important things I built. Not because it looks good in a demo, though it does work well as an interactive component, but because it kept catching things that the pass/fail numbers never would have.

A more recent example was even more subtle. While running tests on a new model, I noticed through the visualizer that the screenshots it was receiving looked wrong, stretched in ways that didn't match the device screen. The other models I'd tested had stayed within the advertised MCP numerical limits, so I'd never hit this code path. What was happening was that the server was clamping the dimensions down, but then the scaling function was expanding the image back up to fill the originally requested size, producing a stretched, distorted image. The model was receiving data that did not match the screen.

If that bug had gone unnoticed and this system were being used for training, the model would have been learning from corrupted visual data, associating actions with screenshots that didn't accurately represent the device state. That's the kind of silent failure that can poison an entire pipeline, and the only reason I caught it was because I could look at what the model was actually seeing and compare it to what was on the device.

A final pass rate can tell you that something changed. A replay can tell you where to look. That is why I keep coming back to the same point: you need to be able to inspect individual runs. Not for every run, but when something looks off, being able to open a replay and watch what happened is the difference between understanding your system and guessing about it.

Closing

What surprised me most about this project wasn't any single result. It was how much I learned from a small number of well-defined tests, once I had the tooling to actually see what was happening.

The pass rates told me whether something worked. The replays told me how it worked, or why it didn't. The binary scoring kept me honest, no partial credit, no subjective judgment. And running the same tests across different models and versions surfaced differences I wouldn't have noticed from the outside.

I think this kind of work, small, grounded, reproducible evaluation with good visibility into what the model is actually doing, is undervalued relative to how useful it is. It doesn't require a large team or a research budget. The system I built is a few hundred lines of TypeScript, an Android emulator, and an MCP server. The tasks are each about thirty lines. It took hours to get a proof of concept running, not weeks.

I do not need another impressive demo. I need a growing pile of small, deterministic, replayable regression tests.

The full source code is available here.