韩师傅就是我

测试开发工程师

为什么大模型(LLM)可以控制手机

前言

最近一个挺火的项目droidrun开源了,这个项目的亮点是通过大模型,以自然语言控制手机APP,大概是这样

Prompt: “Go to Amazon, search for headphones and write the top 3 products to my colleague on WhatsApp.”

之前我写过一篇文章分析为什么python脚本可以操作手机,今天我们透过droidrun来看下大模型(LLM)是如何控制手机的

ReAct范式

为了让大语言模型(LLM)在越来越复杂的场景中,输出越来越靠谱的答案,人们在如何给大模型下达命令方面,下了许多功夫,想了很多办法,这些办法逐渐形成了一门学问,叫提示语工程(Prompt Engineering)。而ReAct(Reasoning and Acting)正是2022年左右提出来的一种提示语工程的范式。它的核心思想就是让大模型在“干中学”,一边思考,一边实践。ReAct的工作流程是think,act和obs(observe)的循环。(这就是为什么你在使用ReAct范式的程序中,总都能看到这3种组件),下图引用自ReAct的论文

回到DroidRun中来

DroidRun项目是ReAct范式的一种实现,完成任务的过程是Reasoning(推理)和Acting(行动)的循环

DroidRun项目的核心组件是ReAct驱动的 AI agent(ReAct Agent),ReAct Agent把用户输入的自然语言以user prompt的形式传给大模型API,再用大模型推理出的结果操作手机。具体步骤大概像这样。

step1.用户把自己的目标以自然语言告诉ReAct Agent(ReAct是Reasoning and Acting的缩写)

Step2.ReAct Agent把目标、历史和可用的工具告诉大模型,让大模型进行推理如何才能完成目标。

Step3.大模型经过推理,告诉ReAct Agent要实现A目标必须先采取B动作

Step4.ReAct Agent执行B动作,并把上一轮的推理过程,以及执行B动作的结果保存到历史当中

Step5.ReAct Agent把目标(没变)、历史(更新了)和可用的工具(没变)再次传给大模型,大模型据此作出下一步的指示。

循环往复,直到某一个action执行完毕后,大模型分析执行结果发现目标已达成,整个过程就结束了。(真实的推理过程见文末的实操环节)

droidrun通过adb命令来操作安卓手机

安卓是一个开放的操作系统,它对外提供了adb,adb提供了一系列通过屏幕坐标来操作手机的命令(苹果就没有提供,所以为了保持环境轻量化,和架构简单化,很多自动化工具都是先支持安卓)。比如点击屏幕中某一个点的adb命令是

adb shell input tap x y

使用adb的好处是,功能齐全,环境还非常轻(安卓sdk自带工具),不需要额外的依赖。坏处是只能通过屏幕坐标来操作手机,而坐标不易获得,且健壮性差(换一个手机坐标很可能就变了)。

droidrun如何获取UI元素的屏幕坐标

上边说了adb命令只能通过屏幕坐标去操作手机,现在用户输入“点击登录按钮”,droidrun是如何获取登录按钮的屏幕坐标的呢?

它开发了一个代理程序(com.droidrun.portal)安装到手机上,通过adb命令发送广播和portal app通信,比如adb shell am broadcast -a com.droidrun.portal.GET_ELEMENTS,这个命令就可以获取当前页面中可点击的元素及其属性,属性中就含有元素的坐标值。实际返回结果是这样

[{‘text’: ‘登录’, ‘className’: ‘TextView’, ‘index’: 1, ‘bounds’: ‘39,302,199,501’, ‘resourceId’: ”, ‘type’: ‘clickable’, ‘isParent’: True}]

粗粒度看完了,我们补充一些技术细节

ReActAgent to LLMReasoner

  • 下图是推理的过程(注意LLMReasoner的入参和返回的数据)
  • LLM Provider API就是具体的大模型的API比如deepseek的API

ReActAgent如何调用工具并存储调用结果

如何通过portal app获取页面布局

最后,除了adb命令还有哪些方式可以操作手机

这篇文章谈了如何使用原生和衍生工具操作安卓和苹果手机《不用手指操作手机的N中方法》

附录

当你让Droidrun帮你用deepseek r1打开手机上的计算器app,真实的推理过程如下。

PC:~/python_venv/droidrun/bin$ droidrun “Open the calculator app” –provider deepseek

INFO:droidrun:=================================
===== Debug Response =====
“`json
{
“thought”: “I need to determine the current screen state to locate the Calculator app. First, I’ll use get_clickables to retrieve all interactive UI elements. If there’s an element labeled ‘Calculator’ or with a package name like com.android.calculator2, I can tap it directly. If not, I may need to open the app drawer or use start_app with the correct package name.”,
“action”: “get_clickables”,
“parameters”: {}
}
“`
=================================

INFO:droidrun:=================================
===== Debug Response =====
“`json
{
“thought”: “The current screen shows the home screen with several apps like ‘SoloPi’, ‘Appium Settings’, etc., but the Calculator app is not visible. To open the Calculator, I need to access the app drawer. Since there is no direct ‘Apps’ or ‘Menu’ button visible, a common way to open the app drawer is by swiping up from the bottom middle of the screen. This should reveal all installed apps, including the Calculator.”,
“action”: “swipe”,
“parameters”: {
“start_x”: 360,
“start_y”: 1500,
“end_x”: 360,
“end_y”: 800,
“duration_ms”: 300
}
}
“`
=================================

INFO:droidrun:=================================
===== Debug Response =====
“`json
{
“thought”: “After swiping up to open the app drawer, I need to check if the Calculator app is now visible. I’ll use get_clickables again to inspect the current UI elements. If Calculator appears in the list, I’ll tap it. If not, I may need to scroll further or check different app categories.”,
“action”: “get_clickables”,
“parameters”: {}
}
“`
=================================

INFO:droidrun:=================================
===== Debug Response =====
“`json
{
“thought”: “After swiping up, the app drawer does not seem to have opened as the same UI elements are visible. The Calculator app is still not present. Since swiping up did not reveal additional apps, an alternative approach is needed. I’ll check installed packages via list_packages to find the Calculator’s package name, then launch it directly via start_app.”,
“action”: “list_packages”,
“parameters”: {
“include_system_apps”: true
}
}
“`
=================================

INFO:droidrun:=================================
===== Debug Response =====
“`json
{
“thought”: “The app drawer swipe attempt didn’t reveal Calculator, but list_packages shows ‘com.android.bbkcalculator’ which appears to be the calculator app. Since direct navigation failed, I’ll launch it directly using its package name via start_app.”,
“action”: “start_app”,
“parameters”: {
“package”: “com.android.bbkcalculator”
}
}
“`
=================================

为什么大模型(LLM)可以控制手机

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

Scroll to top