djfa.ai
snapshot a3f1a1Add to carta2a3a4

Clawdroid

Giving OpenClaw a Phone

TL;DR -- OpenClaw and Hermes can drive real Android apps inside Waydroid on Linux, so agents can use Amazon, Uber Eats, Instacart, etc. without the CAPTCHAs that block the web equivalents. Snapshots the UI into short-lived refs (a1, a2, ...), taps them through ADB, falls back to a vision model when the accessibility tree gets weird.

By Jeff NashMay 23, 20268 min read
The Question

Can a Guy Use Hermes to Order Shampoo Online Without Getting Treated Like a DDOSer?

I've been cooking up a little plugin for OpenClaw (and Hermes) that allows them to interact with apps via an Android emulator, and I'm (somewhat predictably) calling it Clawdroid. It consists of a Python/FastAPI daemon + APK + Openclaw/Hermes plugin + setup scripts that work together to make this surprisingly complex task possible. These 4 things let you spin up a Waydroid emulator on Linux, install real Android apps on it, and then let an agent tap around inside those apps from whatever conversation you're already having with it. So you could be chatting with Hermes, ask it to order you shampoo from Amazon, and it actually opens the Amazon app, searches for shampoo, adds it to your cart, just as it would normally try to do on a browser. Same thing with Uber Eats, Instacart, CNN, or whatever else you choose to install manually or using the agent. The repo is here on GitHub if you want to GTFO of this article and just get it working.

Let's get the obvious question out of the way before you call me an agentic hipster and close this article: why not just use Playwright or Puppeteer or Selenium, the tools everyone reaches for the second they hear "agent" and "browser" in the same sentence? It's what's built into Openclaw and Chrome MCP is a thing, so why bother? If you want to test your own website after vibe coding it in 20 minutes, these are certainly the tools for the job. The problem is that those tools spent the last decade getting used for scraping and credential stuffing, so every major website now has bot detection specifically trained to spot them when you actually try to use them in the wild. The detection doesn't care what you're trying to do with the tool (up to and including actually trying to literally give the site owners your money); it just notices you're probably not a human, flags, and blocks it. So your agent fires up Playwright to check flights or add something to a cart (again, things that will put money in the site owners' pockets), and it gets treated like someone brute-forcing a password. This problem has not gone unnoticed by these site owners and the stewards of the web in general (i.e. Google and no, I didn't mean e.g.). Accordingly, there are several approaches in the works to solve it. This doesn't help you today, unfortuantely. In the meantime, while the industry is arguing about MCP vs CLI like it's HD-DVD vs Blu-ray for the right to burn tokens on ordering toilet paper, the actual problem just sits there, much like you on the toilet because your agent couldn't order toilet paper.

After thinking about it for a while, I realized that Android apps do all the same stuff as their web counterparts, but nobody's training bot detection models against people tapping around in the Amazon app on their phone; you're just assumed to be a user at that point. So why not just let agents use a service's app instead of their website? I figured I'd try building something and see how far I could get before something broke and frustrated me horribly to the point of giving up. Spoiler alert: I was pleasantly surprised!

The Problem

The CAPTCHA Wall

If you've used the OpenClaw browser tool, or really any of the popular browser-using agent setups, you already know exactly how this story ends. You fire up the agent, ask it to do something dead simple like grab a headline or check a price on some product page, and within about two clicks it slams face-first into a CAPTCHA or a Cloudflare interstitial. Sometimes, if the stars align, the agent actually solves the puzzle, but usually it doesn't, and either way you've burned tokens, time, and a small amount of dignity watching a computer fail to do something I could have done myself in four seconds with one hand on my coffee.

A CAPTCHA, if you're a human, is a minor inconvenience: you click some traffic lights, you mutter something unkind about Google, and you move on with your life. It takes like five seconds max (except for those really annoying puzzle piece ones where you're sliding a jigsaw around with your thumb) and then you're done, but the agent can't do any of that, so the task just dies right there, and you're left staring at a log full of failed browser actions wondering why you bothered.

1
browser_navigate — "Verify you're human" interstitial.
2
browser_click — reCAPTCHA challenge: "select all squares with traffic lights."
3
browser_click — "Unusual activity detected, please try again later."
4
browser_navigate — soft-blocked. Agent gives up and apologizes.

A real session, lightly anonymized. Three of the four browser actions did nothing.

Now, the obvious next thought is to just brute-force your way through the problem. You can buy residential proxies, or pay a CAPTCHA-solving service that launders your automation through a click farm somewhere, and to be fair, those approaches do work for a while. Then they stop working, because the side with the bigger budget is the side that's trying to detect you, and I promise you they have a lot more budget than I do. I can't out-spend the ad-tech industry on a weekend side project, and I know a losing game when I see one.

Rather than continue to bang my head against this particular wall, I found a new wall, which happened to have "complex, 4-layered solution with multiple modes of failure" written on it with permanent marker.

The Insight

Mobile Is the Unfortified Flank

If you list the things people most often want agents to do for them, you'll notice a common theme: the services they interact with all have working mobile apps. they all have working mobile apps. Amazon, Uber Eats, Instacart, DoorDash, my bank, my airline, and those apps almost never throw CAPTCHAs at me, I just tap and it does the thing, no "click every picture with a traffic light" nonsense. The app trusts that I am whoever the OS says I am, because that's the deal the app signed when it shipped to the Play Store, and if the platform says you're legit, you're legit.

As the Cloudflare CEO would proudly proclaim on an earnings call before laying off a huge chunk of their workforce "because AI," there's a whole industry built around catching automated web traffic, and all of it exists for one reason: anyone can hit any URL with anything, and the server just has to guess whether to trust it. In 2026 you'll be hard pressed to find a server that doesn't play bouncer to a good old curl request that isn't expected. Mobile apps don't have that problem; the app is signed, the platform already vouches for it, and every mobile API just assumes you're a real person on a real device. Now, bot detection on mobile isn't zero, I don't want to oversell this, but it's a fraction of what the web gets, and the rare time you do get challenged it's usually at login, not mid-checkout when you're trying to actually get something done.

The Web
  • ×Bot detection on every request
  • ×CAPTCHAs gate common actions
  • ×Hundreds of DOM elements per page
  • ×Playwright walks in smelling like a scraper
A Mobile App
  • +Platform vouches for you, CAPTCHAs almost never fire
  • +Sparse screens, couple dozen interactive nodes
  • +OS hands you a structured accessibility tree for free

It also turns out that mobile apps are just easier for an agent to drive than websites, and this one I sort of stumbled into sideways. Think about how you use a mobile app: you look at one screen, you tap one or two things, and you get to the next screen, and that's exactly the shape an agentic loop wants. Compare that to a web page, which might have hundreds of DOM nodes sprawled across a layout that was designed for a 27-inch monitor. A mobile screen has a couple dozen interactive elements on it at any given time, and the OS maintains a structured accessibility tree for screen readers that happens to be about the right size for an LLM to chew on without blowing up your context. So you get the CAPTCHA dodge and the small context window for free, in the same move.

store.example.com
ad · 728×90×
▾ Category
▾ Subcat
▾ Subsubcat
▸ Subsubsubcat
cookies cookies cookies...
~400 nodeshalf decorative, half nested 4 levels deep
9:41
100%
Search
a1
$1.67
Add
a2
Sort
Pricea3
Continue to checkouta4
~12 nodesall things you can tap with a thumb

I mean, you probably do this yourself anyways, reaching for your phone to add stuff to your Amazon cart instead of opening a laptop, because the phone is right there and the app just works. The agent gets that same benefit, and the fact that CAPTCHAs almost never fire is just gravy.

The Host Stack

On the 'agents using Android' front, I'd seen a few plugins floating around that remotely controlled physical Android devices or used them as bridges to send texts. But that seemed brittle, expensive, and unnecessary: our use case didn't need Android for its ability to connect to cell towers. Emulation was the logical next step, since Android is built to run on a litany of devices, unlike iOS. I remembered messing around with BlueStacks back in the day, along with Windows Subsystem for Android before Microsoft killed it, but both seemed geared towards active app development and therefore ill-suited to what I was actually trying to do. Turns out, there is a solution...

The host runtime is Waydroid, and it runs a full LineageOS-based Android image in a Wayland-native container that shares the host kernel. If you've worked with Linux containers before, it's the same idea, except the userspace is Android: binder, init, the full app runtime, all of it. I picked Waydroid over something like Android Studio's AVD because it's not doing QEMU-style emulation, so the apps actually feel snappy. If you've ever tried to use the AVD for anything beyond a quick layout check, you know the pain I'm talking about.

We don't just let an Android emulator for nerds loose on the internet and assume no one will notice. The first thing Clawdroid does is ensure Android pretends it's a Samsung Galaxy S24 Ultra. In reality it's a container sitting next to your other Linux processes, but the apps don't know that, and you'd be surprised how many of them decide what to render based on the device-model string alone. I picked the S24 Ultra specifically: if you go with some obscure profile that no app has ever seen before, that's a quiet way to get yourself bot-flagged before you've even done anything interesting. You can override this with --device-profile if you have a reason to, but I'd leave it unless you know what you're doing.

There is a catch, though: most apps still ship ARM-only binaries, so if you're on x86_64 you need an ARM translation layer to make them work. In practice, most Android apps are simple enough that you can't even feel the overhead, and if you happen to be on an ARM Linux host you wouldn't need the layer at all, though I haven't personally tried that yet. Clawdroid installs a translation layer by default, and the repo covers how to pick a different one if you have an opinion on the matter.

Waydroid wants a real Wayland session to run in. If you're already on Wayland, whether that's GNOME, KDE, sway, whatever, you're good to go and don't need to think about this at all. If you're still on X11, though, Clawdroid handles it for you by spinning up a nested Weston compositor inside your X session and pointing Waydroid at that. The agent has no idea any of this compositor juggling is happening. From your perspective you just see a little Android window pop up that you can ignore, minimize, or watch if you're curious.

The stack
top: agent · bottom: kernel
Agent surface· the part you talk to
01AgentHermes or OpenClaw conversation
02DaemonPython process on the host, owns all logic and policy
Android container· inside Waydroid
03Bridge APKAccessibility service + in-Android HTTP server
04Android OSLineageOS image, ARM-flagship device profile
Host· your actual Linux machine
05WaydroidContainer runtime, binder + cgroups
06CompositorWayland native, or nested Weston on X11
07Linux kernelEverything else sits on this
agent actiondaemonbridge (in Android)app

One decision you have to make (and this will separate the LineageOS purists from the ones who just want shit to work) is whether to bring in any Google framework at all. A lot of Android apps quietly depend on Google Play Services for things like push notifications and location, so you can't just skip it entirely and expect everything to work. My solution was to go with MicroG by default, an open-source reimplementation of Google Play Services that handles the stuff most apps actually need: push notifications, location stubs, attestation in a permissive mode, all without dragging the real Google framework onto your machine. I'd rather avoid having the full Google stack running if I can help it, and so far MicroG has been enough. If you end up needing real GApps for something specific, the setup script can install them, but I personally haven't hit that wall yet.

The Accessibility APK

Quick aside on UIAutomator, since I know someone's going to ask. It's the tool most people think of immediately when they say Android and automation in the same sentence. Not only is it the standard test-automation API for Android, it's certainly more powerful than what I built. Why reinvent the wheel? The issue is this: UIAutomator's start sequence is liable to do things like unbind a running accessibility service which, in this case, would kill my custom APK that relies on this service. The powerful tool elbows the steady one right out mid-session, and then my whole accessibility bridge goes down. So Clawdroid runs on the accessibility tree for all normal operation, and UIAutomator only comes out for admin-level recovery where stability is already shot and I have nothing left to lose.

If you've done any work with Android accessibility, you know the OS maintains a tree of every interactive element on screen. Screen readers use it, test frameworks use it, it's Android handing you a structured description of what's on the phone for free. Now, screenshots-plus-vision is what everyone tries first since it feels more intuitive, and Clawdroid does fall back to it when the tree is useless or missing, but starting there is dumb: you pay image tokens, your actions become guesses at pixel coordinates, and the structured data Android was already handing you for free goes in the bin, so why would you do that to yourself.

Where the bridge lives
two processes, two address spaces
HOST · LINUXWAYDROID · ANDROIDDaemonPython · owns logicHTTP on localhostBridge APKAccessibilityService+ in-Android HTTP serverTarget appAmazon, Uber, Instacart, ...whatever the agent is drivingactionstreeadb forwardreads a11y tree
The bridge is a small APK with an HTTP server running inside Android. The daemon, on the host, reaches it through an ADB-forwarded localhost socket. From the daemon's perspective it's just curl; from the APK's perspective it's a normal Android service.

Clawdroid's bridge into the accessibility tree is a small companion APK that I've kept deliberately minimal. It registers an AccessibilityService so Android feeds it a stream of UI-tree updates, and alongside that it runs a tiny HTTP server inside Android so the daemon on the host can ask "what's on screen right now?" and "tap that thing." There's also a one-screen setup activity since the first time you install it you need to flip the accessibility-service toggle in Settings, and that's not something I can do programmatically, annoyingly enough. I was tempted more than once to put more logic on the Android side, but every time I tried it made debugging way harder, now you've got state split across two processes on two different OSes and you're playing detective across an adb bridge. So the APK stays dumb on purpose, and all the actual decision-making happens in the daemon on the host side where I can just read logs like a normal person.

If you're coming from web development, and I'm going to guess most people reading this are, you need to unlearn some assumptions about how "the page tree" works. The Android UI tree is not the DOM, and the differences will bite you. The DOM never had two views claiming the same screen coordinates and both insisting they're on top, but Android does that constantly. I've started thinking of it as less a tree and more a hedge. You've got these overlapping subtrees with their own coordinate systems and strong opinions about which one is on top, and half the time you're arguing with Android about what's even visible right now.

DOM vs. Android UI tree

On the web, the node is the tree. On Android, the node is a serialized projection of the tree.

DOM
Element / HTMLElement
el.tagName
// "BUTTON"

A live handle into the document. Mutating it mutates the page.

Android
AccessibilityNodeInfo
node.getClassName()
// "android.widget.Button"

A serialized copy sent over Binder IPC. The real View lives in the foreign app's process.

Six reasons it's a hedge
?
WebViews are opaque
One Android node from the outside, a lazy virtual tree from Chromium on the inside. Hybrid apps (Cordova, Capacitor, older Flutter) often look like a blank rectangle to your agent.
RecyclerView recycles ViewHolders
Off-screen rows are destroyed, not hidden. The accessibility tree only contains visible-plus-prefetch rows; scrolling changes which nodes exist.
isImportantForAccessibility() lies
Third-party apps regularly mark interactive nodes as NO and decorative nodes as YES. The flag is a hint, not a contract.
id=42id=42id=42
resource-id is not unique
Devs are encouraged to keep them unique per screen, but the platform doesn't enforce it. Identical rows in a list share the same id.
Custom canvas drawing is invisible
Anything drawn directly to Canvas (games, charts, some custom widgets) is a single rectangle to your agent. No text, no children, no semantics.
id=null
Compose breaks resource-id
A Compose app you don't control may show one big AndroidComposeView plus a flat merged semantics tree with zero resource-ids.

A node in this "hedge" looks nothing like what you're used to if you're a web dev. When the daemon asks the bridge for the current screen, the bridge walks the active window's accessibility tree and spits out a list of node dicts. I made them dense on purpose, with each one carrying about twenty fields, and that might seem like a lot, but I want the daemon to be able to make most decisions from the tree alone without having to turn around and ask follow-up questions about individual fields. The round-trip cost of going back for more info adds up fast when you're doing it hundreds of times per task.

one node, snapshot of an Amazon product page
{
  "ref": "a7",
  "role": "Button",
  "text": "Add to cart",
  "content_desc": null,
  "hint_text": null,
  "class_name": "android.widget.Button",
  "resource_id": "com.amazon.mShop.android.shopping:id/atc_button",
  "bounds": [120, 1840, 960, 1976],
  "actions": ["CLICK", "LONG_CLICK"],
  "editable": false,
  "scrollable": false,
  "checked": false,
  "package": "com.amazon.mShop.android.shopping",
  "window_rank": 0,
  "semantic_label": "Add to cart",
  "container_key": "checkout-cta-row",
  "section_key": "product-detail"
}

Every field in that node dict is straight outta Android except the ref. The daemon assigns it itself, walking the node list and handing each one a short ref (a1, a2, a3, and so on). Refs are short on purpose so the agent can write "a7" in a tool call instead of a full resource-id or bounds rect, but they're also short-lived, and this is the part that trips people up. A ref only lives as long as the snapshot it came from and gets dropped the moment the screen changes, whether that's an app open, an intent launch, a URL navigation, whatever, and any window transition kills them. You can't cache "the search field" across screens, and if you try, you'll get stale refs pointing at the wrong buttons and a very confused agent.

The Action Layer

Seeing is one thing, but the agent also needs to actually touch things. The action layer has two paths it can take, and which one fires depends on how the element was originally discovered. When the agent calls act(snapshot_id, ref="a7", op="click"), the daemon goes and looks up the handle for that ref in the snapshot. If the handle came from the accessibility tree, which it usually does, the bridge uses AccessibilityNodeInfo.performAction(ACTION_CLICK) to fire a real, semantic click. This is actually the same mechanism TalkBack uses when it taps something on behalf of a blind user, so at least the approach has precedent. If the handle didn't come from the accessibility tree, or the bridge isn't reachable, it falls back to ADB and just taps the pixel coordinates instead.

The ADB fallback is as boring as the name "Android Debug Bridge" would imply: it pulls the bounding rectangle of the element, computes the center point, and fires a tap at those coordinates. It works, but it's not smart: the OS has no idea what was tapped, only where on the screen a finger landed. Swipes, key events, text entry all go through the same dumb-but-reliable pipe.

Action dispatch
Fast path: bridge
act(ref="a7", op="click")
POST /node_action
AccessibilityNodeInfo
.performAction(ACTION_CLICK)

Semantic. The app sees a real, dispatched click event. Works through scroll containers, dialogs, and dynamic views.

Fallback: coordinate ADB
act(ref="a7", op="click")
center := bounds.center()
adb shell input tap
540 1908

Pixel-accurate but semantically blind. The OS doesn't know what got clicked, just where.

Like a real human transfixed by their phone, the bridge has roughly two seconds to respond until we give up and drop down to ADB. Two seconds sounds like a lot, but on a cheaper phone running a heavy app, accessibility calls really can take that long. I tried shorter timeouts early on and kept getting spurious fallbacks that weren't necessary.

Because the bridge path targets the node key directly, not a screen coordinate, it tends to hit the right thing even when the snapshot has gone slightly stale. The UI might have shifted a few pixels since the agent last looked, maybe a notification banner slid in, maybe a list scrolled a hair, and the bridge doesn't care, it's talking to the node, not the pixel. This ends up being sort of critical in practice because the agent is not fast; by the time it decides what to click, the screen has often drifted.

And then there are things the daemon just refuses to do. Before every click, the protected-action guard looks at whatever the ref's label says and checks it against a list of high-stakes phrases: things like "Place your order" or "Confirm purchase." If there's a match and the daemon has approval mode turned on, and it does by default, please don't turn it off, the action gets kicked back as "needs approval" and the agent has to ask again, this time with a token that proves a human actually said yes. I added this guard the same afternoon a demo ended with my home address being read back to me on camera.

The protected-action guard
checks every click against a blocklist
act(ref="a3")Place your orderhigh-stakes clickGUARDmatch label againstblocklist...matched ✗BLOCKEDneeds_approval = trueagent must reissue with tokenact(ref="a7")Add to cartordinary clickGUARDmatch label againstblocklist...no match ✓EXECUTEDACTION_CLICK firesno approval neededBLOCKLISThigh-stakes phrases that require explicit approval"place your order""buy now""submit order""sign in""confirm purchase""send"
The guard runs before every click. If the target ref's label matches a high-stakes phrase, the action is rejected with needs_approval, and the agent has to explicitly reissue with an approval token. The blocklist above is what ships by default; the daemon's config lets you add to it.

The Agent Loop

Your first impulse when you build something like this is to carry state forward between iterations. You've got these nice refs from the last screen, so why throw them away? I built that version first, obviously, and it took about five minutes to shit the bed. Refs from a previous screen would collide with refs on the current screen, the agent would try to tap something that wasn't there anymore, and the whole thing would spiral. So the loop drops everything every single iteration, and refs from the last screen disappear the moment the screen changes, and it feels wasteful until you watch the alternative eat itself.

STEP 1task_routeSTEP 2snapshotSTEP 3decide_nextSTEP 4actSTEP 5snapshotLOOPn = 1UNTIL DONE
now: task_route

snapshot actually has a few different modes, and which one you reach for depends on what you're trying to accomplish in a given step. The one I use almost all the time is interactive, the default, and it gives you the bridge tree filtered down to only the nodes a user could actually tap on. This keeps the context small, and that matters more than you'd think when you're feeding this stuff to a model on every single iteration. Then there's hybrid, and that bolts an ADB screenshot onto the tree alongside the structured data. You want this when you're about to fall back to vision and need actual pixels, not just a node hierarchy. And then there's full, returning everything including invisible nodes. That one sounds like it should be the most useful, but the context bill is enormous. I almost never feed it to the model unless I'm properly stuck trying to figure out why some element isn't showing up in the other modes.

Why refs die
one tap, all refs invalidated
9:41Amazon · Product[product image]$1.67Add to carta7a1a2a3a4snapshot @ t=1tap a7"Add to cart"snapshot invalidateda1 · a2 · a3 · a4 · a5 · a6 · a7fresh refs minted9:41Amazon · CartWhole BlendsShampoo3 fl oz · $1.67Continue shoppinga7Place your ordera3a1a2a4a5snapshot @ t=2 (fresh)
Same ref number, completely different node. a7 was "Add to cart" on the product page; one tap later it's "Continue shopping" on the cart page. The daemon throws away the entire snapshot the moment the screen changes, so the agent is never tempted to reuse a stale ref.
If the agent tries it anyway
Strictdefault
act(snap_t1, ref="a7")
error: snapshot stale
snapshot() // re-fetch
act(snap_t2, ref=...)

Refuses to operate on a stale snapshot. The agent has to re-snapshot and pick fresh refs. Safe but talky.

Forgivingopt-in
act(snap_t1, ref="a7")
match cached handle
id=atc_button · text="Add to cart" · class=Button
unambiguous match → proceed
no match → reject anyway

Tries to find the same logical node on the current screen by resource-id, text, and class. Fewer round-trips, but only proceeds if the match is unambiguous.

Ref lifetimes took me a while to get right, and the whole design ended up being "throw everything away," and it felt wrong at first. Refs cannot survive screen transitions, because the moment the screen changes, every ref from the previous snapshot is garbage. So the daemon nukes the whole snapshot whenever it detects a screen change, doesn't matter if it's a new app, a URL opening in Chrome, whatever, if the bridge fires a window-state-changed event the refs are dead. So what happens when the agent tries to use a ref that's already dead? By default the daemon just tells the agent "hey, your snapshot is stale, go re-snapshot," the safe option. But if you've turned on forgiving mode, it tries to match the cached handle to whatever's on the current screen using the node's signature, basically the resource-id plus text plus class. If there's an unambiguous match it proceeds, and if not, it still tells the agent to re-snapshot anyway. This ended up being really important for flows where screens change but the element the agent cares about is actually still there, just in a slightly different tree.

As it turns out, agents refer back to what they saw a few steps ago way more often than you'd guess, so I have the daemon cache the last ten snapshots. "I saw a button earlier that said X, let me go back to that," that kind of thing. Without the cache, the only way to support that would be keeping the entire tree in the model's context for every previous step, and the token count would blow up almost immediately. With the cache, the agent can just ask for snapshot number seven or whatever and get it back cheaply.

How decide_next decides
cheap first, escalate as needed
decide_next( snapshot, goal )deterministicrule-based picker, tree onlyinstantCOSTfreeconfident enough?yes → actno ↓ escalatellm_texttext-only LLM, tree alone~1sCOST$tree got us there?yes → actno ↓ escalatellm_visionmultimodal LLM, tree + screenshot~2sCOST$$actauto (default)
auto (the default) walks this cascade top-to-bottom, stopping at the first tier that's confident enough to answer. You can also pin to a single mode if you know which one you want -- a deterministic-only setup is free and predictable; a vision-only setup is the most powerful and the most expensive.

The Vision Fallback

Now that you see the accessibility forest for the accessibility trees and you're feeling pretty good about life, and then you open an app where the button you need simply isn't there. The button isn't hidden or mislabeled, it's just completely absent from the tree. This happens more than you'd think: someone drew the button directly to a Canvas, or they stuck it inside a WebView that's dressed up to look like native UI, or they built the whole thing in Compose and never bothered opting into test tags. The accessibility tree gives you back a perfectly valid response, which just doesn't contain the thing you actually came looking for.

This infuriating lack of consistency also shows up in subtler forms, like hidden nodes and off-screen RecyclerView rows that the tree technically knows about but that nobody can actually interact with since they haven't been laid out yet. The tree isn't wrong, exactly, it's just not telling you the whole story.

When this happens, the daemon punts to a vision model. I spent a while trying to figure out when exactly the fallback should kick in, and I deliberately kept the trigger fuzzy because, honestly, there's no clean heuristic for "the tree is lying to you," and it's more of a vibe than a rule. It fires when the top refs all look low-confidence, or when the goal names something that's clearly nowhere in the tree. It also fires when the last attempt failed in a "right idea, wrong target" kind of way, and that's the daemon's polite way of saying "I tapped the wrong thing." The agent can also just ask for vision directly if it already knows the tree is going to be useless for a particular screen. In any of those cases, the daemon grabs a screenshot and ships it off to a multimodal model to sort out.

I went with UI-TARS for this, a small open model from ByteDance that does exactly one thing: it looks at an Android-ish screen and tells you where to tap. You might be wondering why I'm using a years-old model (though it was a pretty cutting-edge sleeper model at the time) when there are genuinely impressive multimodal models that can understand entire app flows and reason about what to do next. But by the time vision fires, the daemon has already figured out what the agent wants to do, so all I need is something that can point at where the thing actually is, and that's a way easier job than understanding the whole UI; while it's not winning any leaderboards, it doesn't need to.

~/.config/openclaw-android-waydroid/llm.json
{
  "default_provider": "openrouter",
  "default_model": "bytedance/ui-tars-1.5-7b",
  "providers": {
    "openrouter": {
      "base_url": "https://openrouter.ai/api/v1",
      "api": "openai-completions",
      "api_key_env": "OPENROUTER_API_KEY",
      "models": [
        {
          "id": "bytedance/ui-tars-1.5-7b",
          "name": "UI-TARS 1.5 (7B)",
          "input": ["text", "image"]
        }
      ]
    },
    "local": {
      "base_url": "http://127.0.0.1:8000/v1",
      "api": "openai-completions",
      "models": [{ "id": "ui-tars-1.5-7b", "input": ["text", "image"] }]
    }
  }
}

For just three pennies a day (cue Sarah McLachlan's "In the Arms of the Agent"), your poor agent, struggling to find the "Add to Cart" button on a Compose screen with no resource IDs, can get the vision fallback it so desperately needs. UI-TARS at this size goes for fractions of a cent per call on OpenRouter. I've run agents doing a few hundred snapshots a day where vision fires on maybe a fifth of them, and the bill barely registers. Barely as in I had to scroll down on the billing page to even find it. You can also point the config at a local vLLM or Ollama endpoint if you'd rather not pay at all, and that's what I do for development.

What a vision call looks like
screenshot + refs in · ref or coords out
INPUT9:41Amazon · Cart? target+ ref overlaya1a2a3a4a5a6a7a8MODELUI-TARSmultimodal · 7Btrained on UI clicks~$0.0005 / callOUTPUTCOMMONmatched a known ref{"decision": "click","ref": "a5"}FALLBACKtarget wasn't in tree{"decision": "tap","x": 540, "y": 1820}
The daemon hands UI-TARS a screenshot with the current refs overlaid, plus the goal. The model answers with one of two shapes: a known ref (the common case, because the snapshot already contained the answer) or raw pixel coordinates (the fallback, when the target genuinely wasn't in the accessibility tree). The daemon normalizes both into one internal vocabulary.

Obviously, the vision fallback is a worst-case-scenario hack that I tearfully told Codex to implement after trying literally everything else. I'd love it if the bridge alone were enough, but the reality of Android in 2026 is that apps are full of Compose screens with no resource IDs and WebViews that are pretending to be native UI, and you can't just opt out of that reality because you'd prefer a cleaner architecture. I tried, and the agent would get stuck on real apps like Amazon while breezing through my well-behaved test apps. So a cheap vision model sits behind the bridge as a safety net, and now the system actually works on the apps people care about. If you never bother setting up a vision model, the system just doesn't use one, skills won't ask for screenshots they can't do anything with. You'll hit the occasional dead end on tricky screens, but nothing breaks.

Hermes & OpenClaw

You could wrap the whole daemon in a single tool and call it a day, and one tool for all the actions sure does sound simple and nice. The problem is that an agent allowed to install apps should not necessarily be allowed to place an order inside one of those apps without someone explicitly saying "yes, go ahead," since those are very different levels of trust. So there are two tools registered with Hermes and OpenClaw instead of one, and that split is the whole point.

Under the hood the daemon is just a regular HTTP server. You could absolutely drive it from a shell script, and I did exactly that during early testing. It works fine for fire-and-forget commands, but a shell script has no way to hand the model structured tool descriptions, and it certainly can't enforce approval gates before dangerous stuff runs, so that's where the plugins come in.

These two tools are exposed in the Hermes and OpenClaw plugins that are packaged in the Clawdroid repo. android is the runtime surface, the one that handles the bread and butter of actually doing things on the phone: taking snapshots, performing actions, figuring out what to do next. Then there's android_admin, the one that installs APKs, wipes containers, and will generally ruin your afternoon if you call it wrong. Both of them are thin wrappers, really just forwarding JSON to POST /v1/agent/dispatch and POST /v1/admin/dispatch respectively.

The two-tool split
androidruntime
  • status
  • task_route
  • app_open / url_open
  • snapshot
  • act
  • decide_next
  • screenshot
  • wait
android_adminopt-in
  • install_apk · needs approval
  • store_install · needs approval
  • app_remove · needs approval
  • extras_install · needs approval
  • waydroid_start / waydroid_stop
  • recovery (soft / hard)
  • profile_switch
  • allowlist_update
The two-guard stack
install gate · then action gate
AGENT TOOL CALLandroid.act / android_admin.install_apk / ...GATE 1 · PLUGIN APPROVALfires for host-mutating actionsinstall_apkextras_installrecoveryprofile_switchapproved ↓no approval→ REJECTEDGATE 2 · PROTECTED-ACTION GUARDfires for device-mutating clicks (matched by label)"place your order""buy now""submit order""sign in"approved ↓needs_approval→ REJECTEDAndroid executes
Two gates, in series, each at a different layer. The plugin gate catches anything that would mutate the host (install an APK, swap an extras pack, drop into recovery). The protected-action guard catches anything that would mutate the device in a real-money or real-identity way. An agent allowed through the first cannot, without further approval, get through the second.
Live Demo

Sixty Seconds in the Amazon App

What you're looking at below is the whole interaction, start to finish. I gave the agent two instructions, each a single sentence, with zero additional context. The first one asks it to pull up CNN in the Android browser and tell me what's in the news, and the second one asks it to go buy shampoo on Amazon.

hermes -t clawdroid

The shampoo it picked was Whole Blends Honey Treasures Repairing Shampoo, three fluid ounces, $1.67. I used to use this exact brand in college back when I had hair, and I am choosing to read that coincidence as karmic. The agent added it to the cart and then, in true agent fashion, helpfully read my full home address back to me on camera. I had left my real Amazon account logged in on the emulator because of course I had, so that was fun.

After this little incident I went and built a guard for it. The agent can't hit "Place your order" on its own anymore, it has to stop and check with you first, because no, I do not want an AI impulse-buying on my behalf. Having my address show up on camera was embarrassing enough, but if it had actually checked out too, I'd be sitting here trying to explain to myself why I own a $1.67 travel shampoo I never ordered.

The Verdict

You probably are up for the challenge because you made it to the end of this article, but setup is kind of a pita. You're looking at Waydroid, ARM translation, a custom accessibility APK, a daemon, and a vision model, all of which need to be working at the same time before the thing does anything useful. That sounds like a lot, and it is; but it's a one-time cost. You install it, you run the doctor script, and if something is broken the script tells you which piece it is. Once you're past that initial hill, what you get is an agent that has access to Amazon, Uber Eats, Instacart, your bank app, CNN, pretty much every app you actually use. Compare that to the open web, where half the interesting stuff has been CAPTCHA'd behind walls that browser automation just bounces off of.

I built this for the dumb errands, the stuff that isn't hard, just tedious: watching an Instacart cart for a price drop on the olive oil my wife likes, reading CNN's headlines every morning so I don't have to, filling an Amazon cart and stopping right before checkout so I can review it later, DM'ing someone on a platform that actively hates Playwright and will flag you the second you try, and none of that had a clean path before (and I looked). Through Android it does, though, and the reason is almost stupidly simple: the OS already vouches for you. As far as the app is concerned you're just yet another smartphone-addicted member of the general public, just scrolling around.

This probably won't last forever; eventually one of the apps is going to figure out how to lock me out. There are blockers like Play Integrity and attestation and secure-flag windows just sitting right there waiting for someone at one of these companies to flip the switch, but the window is open right now, and I intend to use it for as long as I can. (RIP GitHub Copilot agent header unlimited token unlock.)

For now, though, it's probably the least-janky way I've found to get an agent doing the actual mundane stuff that eats up your afternoon, and whether you should run it yourself is your call because there are definitely rough edges. The repo is on GitHub with the full README, the doctor script, and the smoke tests. There's also a docs folder that covers headless server setup if you're the kind of person who wants to run this in a tmux session on a box you SSH into, which, honestly, is how I run it most of the time. If you try it, let me know what breaks, and if it actually works for you, I'd love to hear what you used it for.

— Jeff