Give an LLM an API and It'll Thrive. Give It a Touchscreen and It Struggles

Give an LLM an API and It'll Thrive. Give It a Touchscreen and It Struggles

The moment this project became interesting was not when a model succeeded. It was when I watched one look straight at the airplane mode toggle, explain that it was about to toggle it, miss, read the screen again on the wrong dialog, then confidently state it succeeded. I've spent years debugging exactly that kind of problem at Google, Google X, and Toyota Research Institute, so this felt natural to me. The fix is always the same: stop guessing and write a deterministic test. So that's what I did.

I wanted a narrower answer to a practical question:

If you give a model only the actions a person has on a phone: look at the screen, tap, swipe, long-press, press a physical button, can it reliably complete small real tasks?

So I built a small Android harness to test exactly that.

Across 1,700+ runs on four stable tasks, I got mostly clean pass rates on basic phone tasks, and a few observations I didn't expect: one case where an older model outperformed a newer one, and a much better understanding of why this kind of evaluation is worth doing carefully.

The model finds "Airplane mode", sees it's disabled, then mistakenly taps "SIMs" instead of the "Airplane mode" toggle:

Learnings TL;DR

The results are below and you'll probably scroll to them first, but the thing I actually care about is the methodology, because I think it generalizes well beyond this project:

  • Tools are more important than the code. - The numbers on one of my tests looked fine. The replay tool showed me the test was actually flawed. Without visibility into what the model was doing, I would have published bad data with good-looking numbers.

  • If you can't verify it deterministically, don't test it. - The model's behavior can vary. The scoring cannot.

  • A flaky test is a bad test. - I don't expect a single run to be deterministic, but I expect the pass rate over many runs to be stable and meaningful.

Results

The most interesting finding was that gpt-5.3-codex outperformed gpt-5.4 on three of four tasks (note: I'm not accounting for server side load). Also, these aren't the same model, the Codex variants are fine-tuned for agentic tool use, while GPT-5.4 is general-purpose. Whether that's a regression or a tradeoff, I can't say. But it's exactly the kind of thing task-level tests are supposed to catch.

Model Comparison
Lower is better for all metrics. Whiskers show 95% CI on fail rate. Tap/click a row for details.
How is this calculated?
Test Period
All runs were conducted between March 28 – April 2, 2026 using each provider's publicly available API.
Plans Used
Tests were run using the following subscription plans:
  • Anthropic — Max 20x plan
  • Google — AI Ultra plan
  • OpenAI — Pro plan
Cost Calculation
Cost = (input tokens × input price) + (output tokens × output price) per successful run. Costs are based on each provider's published API token pricing (input / output per million tokens), not subscription fees. Sources:
  • Anthropic — Opus: $5 / $25, Sonnet: $3 / $15 (source)
  • Google — Gemini 2.5 Pro: $1.25 / $10, Flash: $0.30 / $2.50, Flash Lite: $0.10 / $0.40 (source)
  • OpenAI — GPT-5.4: $2.50 / $15, GPT-5.2/5.3 Codex: $1.75 / $14, GPT-5.1 Codex: $1.25 / $10 (source)
Confidence Intervals
Error bars use the Wilson score interval at 95% confidence.
Timeouts
Each test has a 10-minute timeout. Runs that exceed this are marked as "Timed Out" (shown in the darker segment of the fail bar). Timeouts count as failures in the pass rate calculation.
Low Sample Threshold
Models with fewer than 50% of the max run count are separated into a "Low sample" group.
Duration & Token Metrics
Duration and token bars only average successful runs.

How it works

The core constraint: the model interacts with the phone the way a person would.

That means the agent side of the system only gets a very small action surface (via an MCP bridge):

  • screenshots
  • taps
  • swipes
  • long-presses
  • a few physical button events
  • short waits for UI animations to settle

Under the hood, the bridge is backed by adb (Android Debug Bridge), but that is for the harness, the model itself does not have access to adb or any other super powers a human would not have.

I modeled the capabilities of a Pixel 8a, so the model only gets volume up/down and power as hardware buttons. If it wants to go home, it has to figure out the swipe.

I am not arguing that agents should navigate by pixel. If an API or accessibility tree exists, an agent should use it. But if a model can reliably interact with a UI using only what a human sees, that tells you something about its spatial reasoning, its ability to plan multi-step interactions, and its robustness to visual variation. Those capabilities matter because APIs don't cover everything, and the long tail of real-world tasks will always include interfaces that weren't designed for automation.

Related work worth reading: AndroidWorld gives models special permissions like launching apps by name and access to the UI accessibility tree, which is a reasonable design for evaluating practical agents but tests a different question than I was asking. MobileWorld from Tongyi Lab evaluates frontier models across many more tasks with more complex workflows; their work and mine measure different things despite landing in a similar ballpark of pass rates.

The Architecture

  1. The harness creates a device session.
  2. It restores a clean baseline snapshot.
  3. It runs task-specific setup.
  4. It starts a screen recording.
  5. It hands the task prompt and session ID to the model.
  6. The model interacts through the MCP bridge.
  7. The harness stops recording and verifies the final device state with adb.
  8. The session is cleaned up.

The admin side makes tests deterministic: snapshot loading, setup commands, file installs, emulator console access, screen recording, and final verification.

The agent side stays constrained: screenshots in, UI actions out.

Agent surface Admin surface Test harness Orchestrates everything LLM MCP bridge screenshot, tap swipe, long-press key event, sleep Admin API snapshots, ADB shell recording, verification Android emulator invokes ADB controls ADB

Measuring the results

I did not want a fuzzy score, no R² or similarity metrics, no second model grading the first. I wanted tests that end with a boring, deterministic answer.

  • Is airplane mode on? - Check settings get global airplane_mode_on.
  • Was the alarm created? - Check dumpsys alarm for a pending DeskClock alarm at 17:00.
  • Was Firefox Focus removed? - Check pm list packages and make sure org.mozilla.focus is gone.

If the alarm is not set, the task failed. There is no partial credit. The model's behavior can vary, the scoring cannot.

What makes a good test

I started with turning on airplane mode. Swipe down from the top of the screen to open quick settings, find the airplane mode toggle, tap it. One or two gestures. Once I could see the model actually pulling down the notification shade and hitting the right icon, I knew the basic loop worked, I moved on.

Turning airplane mode off is a separate test and subtly different. When airplane mode is active the icon looks different, so the model needs to identify if the icon is enabled or not.

Setting an alarm for 5:00 PM is is a much more complex interaction. The model has to find and open the clock app, navigate to the alarms tab, interact with a time picker, set the correct time, and save it. The time picker alone is interesting because it varies across Android versions (although the stats captured in this article are only on a single android version, I did see the model interacting with different time pickers in different android versions). Some show a clock face you tap on, others let you type the time directly, and the model has to deal with whichever one it gets.

The last task was uninstalling Firefox Focus, which I sideloaded onto the emulator. The Firefox Focus icon is purple, not the orange-red most people associate with Firefox, so I told the model the icon color to keep the test focused on the uninstall flow rather than icon recognition.

These simple tasks were enough to surface real differences between models, and they got me thinking about how tasks like these could be used to identify weaknesses in model capabilities or even contribute to training.

The model realizes it is a clock and it correctly clicks on the "5" on the clock wheel, presses "ok", then validates it was set:

How I almost published a bad test

There was a fifth test that didn't make the cut. I set up a simulated text message containing a six-digit verification code, and asked the model to open the Messages app, find the code, and tell me what it was.

The results looked good, but once I started reviewing replays I noticed something. On some runs, a notification banner would pop up at the top of the screen with the message preview just after the test started. If the model was fast enough, it could read the code directly from that banner before it disappeared. On other runs, the banner was already gone, and the model had to open the Messages app the long way. The pass rate was partly measuring whether the model happened to screenshot while the notification was visible. That's not what I was trying to test, so I excluded it.

The reason I bring this up is not to explain away missing data. It's because it highlights the importance of writing debug tooling and not just relying on raw metrics being printed. The numbers looked fine, the test was passing at a reasonable rate. Without being able to watch what was actually happening, action by action that the model was performing, I would have published a flawed test with good-looking numbers and never known.

A bad test showing that the model was able to see the text message banner before it faded away capturing the verification code. This put a bias towards faster models, which was not the intended goal:

Write tools not just code

Early on I built a replay visualizer. It shows two views side by side. On the left, the actual device screen with overlays showing where the model tapped, swiped, or long-pressed. In the center, what the model actually saw, the cropped, scaled screenshot regions it requested, layered on top of each other over time. On the right, a timeline of the model's messages and reasoning.

It started as a debugging tool, but it kept catching things the pass/fail numbers never would have.

While running tests on a new model, I noticed through the visualizer that the screenshots it was receiving looked stretched. What was happening was that the server was clamping the dimensions down, but then the scaling function was expanding the image back up to fill the originally requested size, producing a distorted image. The model was receiving data that did not match the screen.

If that bug had gone unnoticed and this system were being used for training, the model would have been learning from corrupted visual data, associating actions with screenshots that didn't accurately represent the device state. That's the kind of silent failure that can poison an entire pipeline, and the only reason I caught it was because I could look at what the model was actually seeing and compare it to what was on the device.

Closing

What surprised me most about this project wasn't any single result. It was how much I learned from a small number of well-defined tests, once I had the tooling to actually see what was happening.

The full source code is available here.