The Dawn of GUI Agent- A Preliminary Case Study with Claude 3.5 Computer Use
如果无法正常显示,请先停止浏览器的去广告插件。
1. The Dawn of GUI Agent: A Preliminary
Case Study with Claude 3.5 Computer Use
Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
Show Lab, National University of Singapore
Abstract
The recently released model, Claude 3.5 Computer Use, stands out
as the first frontier AI model to offer computer use in public beta as a
graphical user interface (GUI) agent. As an early beta, its capability
in the real-world complex environment remains unknown. In this case
study to explore Claude 3.5 Computer Use, we curate and organize a
collection of carefully designed tasks spanning a variety of domains
and software. Observations from these cases demonstrate Claude 3.5
Computer Use’s unprecedented ability in end-to-end language to desk-
top actions. Along with this study, we provide an out-of-the-box agent
framework for deploying API-based GUI automation models with easy
implementation. Our case studies aim to showcase a groundwork of
capabilities and limitations of Claude 3.5 Computer Use with detailed
analyses and bring to the fore questions about planning, action, and
critic which must be considered for future improvement. We hope this
preliminary exploration will inspire future research into the GUI agent
community. All the test cases in the paper can be tried through the
project: https://github.com/showlab/computer_use_ootb.
Add a set of wireless headphones to my cart with a budget
of $100 or less, that has an Active Noise-Cancelling feature.
I'll help you search for
wireless headphones with
active noise cancelling
(ANC) within your $100
budget. Let me open a web
browser to search for this:
…
Computer Use OOTB Interface
Anthropic-defined
Tools
User’s OS
Figure 1: Overview of representative evaluation tasks (left), categorized by Web Search,
Productivity, Workflow, and Entertainment. Our Computer Use Out-of-the-Box framework
(right) provides an easy implementation to execute these tasks in the user’s OS.
Preprint. Under review.
2. 1
Introduction
Automating desktop tasks has become an increasingly popular area of research, driven
by the need to enhance users’ productivity and accessibility across various application
environments. From web navigation to professional software and even video games,
users frequently encounter repetitive tasks that could benefit from automation. While
large language models like GPT-4 and Qwen-2-VL have demonstrated their potential in
automating tasks through general GUI interaction, the capacity of these models is still
far from enough for applicable desktop task automation.
Recent studies in GUI automation agents have leveraged general-purpose LLMs to
interact with graphical user interfaces (GUIs) by understanding the GUI state and
generating actions. However, the release of Claude 3.5 Computer Use by Anthropic
marks a significant advancement in this domain, introducing the first frontier AI model
to offer computer use in public beta. Unlike previous models, Claude 3.5 Computer
Use offers an end-to-end solution through API calls, actions will be generated from user
instruction and observed purely visual GUI state, without requiring further external
knowledge such as reference plan and GUI parsing.
Despite this advancement, the community needs a comprehensive analysis that evaluates
the performance of API-based GUI automation models in depth. To take the first steps
to explore the capacities and limitations of such models, we propose a comprehensive
case study based on real-world desktop environments, encompassing a diverse range
of software domains, including web navigation, professional tools, and games. The
selected cases are designed to reflect the needs of various user groups, ensuring that
the evaluation covers a broad spectrum of desktop automation tasks.
To isolate specific aspects of the model’s capability, we evaluate the performance of
API-based GUI automation models rigorously across three dimensions:
• Planning: Assessing the model’s ability to generate an executable plan from the
user’s query. The plan should have a correct flow, allowing the overall successful
operations of the software, with each step being clear and executable.
• Action: Evaluating whether the model can accurately ground the interactable GUI
elements and execute the action step-by-step from the derived plan.
• Critic: Measuring the model’s awareness of the changing environment, includ-
ing its ability to adapt to the outcomes of its actions, such as retrying tasks if
unsuccessful or terminating execution when the task is completed.
To our best knowledge, this is the first comprehensive case study on Claude 3.5
Computer Use and API-based GUI automation models. We hope that our research
provides the community with valuable insights into the capacities and limitations of
these models. Our case study aim to lay the foundation for the continued exploration and
benchmarking of API-based GUI automation. Additionally, to facilitate the community to
discover and benchmark the newly released model, we also release an out-of-the-box
universal framework, namely Computer Use OOTB, providing a seamless solution for
users and researchers to deploy these models in local environments without the need for
complex setup or configuration, aiming to improve the accessibility of GUI automation
research field.
Our contributions in this report are summarized as follows.
• We present a comprehensive case study for Claude 3.5 Computer Use on desktop
task automation, covering domains such as web search, professional software, and
games, designed to reflect the needs of various user groups.
• We introduce an out-of-the-box, cross-platform agent framework for deploying API-
based GUI automation models, offering a universal solution for easy implementation
and benchmarking.
2
3. • We conduct extensive human evaluations and provide in-depth analyses, demon-
strating both the advancements and limitations of the newly released API-based
GUI automation model.
2
Related Work
Large Vision-Language Models Recent research has invested tremendous effort in
constructing LVLMs capable of jointly processing image and text [1, 2, 3, 4], integrating
vision encoders with LLMs through connecting layers, inheriting LLMs’ linguistic and
reasoning skills to perform vision-language tasks. A series of studies focused on ground-
ing with LVLMs [5, 6, 7], such as providing bounding boxes for objects when generating
responses [8, 9].
GUI Agents Autonomous agents powered by large language models (LLMs), referred
to as language agents [10, 11], have gained significant attention due to their interactive
capabilities [12, 13, 14, 15]. Recent efforts have enabled these agents to interact with
operating systems through programs [16] or API calls [17, 18]. However, the closed-
source nature of most commercial software imposes significant limitations, as agents
often lack access to internal APIs or code. Consequently, research has shifted toward GUI-
based agents that interact with digital devices through human-like mouse and keyboard
actions [19, 20, 21]. Models like WebGPT [22], Agent-Lumos [23], CogAgent [20],
AutoWebGLM [24], Auto-GUI [25], AppAgent [26], ScreenAgent [27], and AssistGUI [28]
have demonstrated improved performance across various tasks, expanding from web
navigation to general GUI automation.
To enhance the effectiveness of these GUI agents, researchers have focused on devel-
oping systems that can interpret human intentions and predict actions in the form of
function calls [29, 30, 31, 32]. Nonetheless, progress is hindered by the limited quantity
and vast diversity of available agent data [33, 34]. Specifically, GUI agents remain
underexplored, with only a few attempts made to train models that effectively ground
GUI interactions [19, 20, 35].
Additionally, SearchAgent [36] introduces an inference-time search algorithm to enhance
multi-step reasoning and planning in interactive web environments. Collectively, these
advancements contribute to the development of more sophisticated and capable GUI
agents, pushing the boundaries of automated task completion across various digital
platforms.
3
Claude Computer Use Revealed
To establish a robust and in-depth analysis of Claude’s Computer Use, we will thoroughly
explore the model design and present a framework for the community to replicate. Our
analysis will draw on various perspectives, emphasizing both the underlying model and
its tools.
3.1
Model Design
The main task of Claude Computer Use can be formulated as follows: when presented
with a user instruction X instr in natural language, the agent is asked to complete a
series of actions on the desktop to complete this instruction. The entire process of
agent-environment interactions from initial to final states involves multiple steps. At
each time step t , the agent will observe the GUI state I t , then decide the next step action
from its action space, perform the action with corresponding tools in order to complete
the task, afterwards, the model will reflect on the action outcome to enhance its future
planning. Following this, we will delve into the detailed design of Claude Computer Use.
3
4. 3.1.1
System Prompt
Below is the system prompt of Claude Computer Use, where environment-specific
variables will be denoted in full capital letters and enclosed in square brackets.
System Prompt
System Overview:
You have access to a set of functions that allow you to interact with a
sandboxed computing environment.
You do NOT have access to external resources, except through the functions
provided below.
You can invoke one or more functions by writing a <antml:function_calls>
block like this:
<antml:function_calls>
<antml:invoke name="$FUNCTION_NAME">
<antml:parameter
name="$PARAMETER_NAME">$PARAMETER_VALUE</antml:parameter>
...
</antml:invoke>
<antml:invoke name="$FUNCTION_NAME2">
...
</antml:invoke>
</antml:function_calls>
String and scalar parameters should be passed as is. Lists and objects
should be passed in JSON format.
The output or any errors will appear in a subsequent <function_results>
block. You can then respond to the user based on the results or make
further function calls.
If a <function_results> block does NOT appear, your function call was
likely malformatted.
Available Functions:
1. Computer Interaction (GUI):
- Description:
Use a mouse and keyboard to interact with the computer and take
screenshots.
You can only interact with the desktop GUI (no terminal or
application menu access).
- Actions include:
- key: Press a key or key-combination.
- type: Type a string of text.
- mouse_move: Move the cursor to specified coordinates.
- left_click, right_click, middle_click, double_click: Perform mouse
clicks.
- left_click_drag: Click and drag the cursor.
- screenshot: Take a screenshot of the screen.
- Important Notes:
- The screen resolution is [SCREEN_RESOLUTION, e.g., 1024x768].
- Always check the coordinates of elements via screenshots before
moving the cursor.
- If a click fails, adjust your cursor position and retry.
- Parameters:
- action (required): The action to perform, such as key, type,
mouse_move, etc.
4
5. - coordinate: The (x, y) coordinates for mouse-related actions.
- text: The text to type or key to press for type and key actions.
2. Bash Shell Commands:
- Description: Run commands in a bash shell.
- Parameters:
- command (required): The bash command to run.
- restart: If true, restarts the tool.
3. File Editing Tool:
- Description: View, create, and edit files.
- Commands:
- view: Displays a file or lists directory contents.
- create: Creates a new file (fails if the file already exists).
- str_replace: Replaces a specific string in a file.
- insert: Inserts a string after a specified line.
- undo_edit: Reverts the last edit made to the file.
- Parameters:
- path (required): The absolute path to the file or directory.
- file_text: The content for creating a file.
- new_str, old_str: Strings for replacing or inserting content.
- insert_line: Line number for inserting content.
- view_range: Specify a range of lines to view.
System Capabilities:
You are using an Ubuntu virtual machine with aarch64 architecture.
You can install applications using apt or pip.
Firefox is installed (use the firefox-esr version).
GUI applications can be started from the bash shell using DISPLAY=:1.
The current date is [DATETIME, e.g., Wednesday, October 23, 2024].
Important Notes:
- Firefox Wizard: If the startup wizard appears, ignore it. Do not click
"skip this step." Instead, click on the address bar and enter the
appropriate URL or search term.
- PDF Handling: If a PDF appears, it may be better to download it using
curl and convert it to text using pdftotext for easier reading.
Summary of How to Use the Tools:
- Function Invocation: To interact with the environment, use the
<antml:function_calls> block.
- Error Handling: If no <function_results> appear, check for malformatted
calls.
- Multiple Calls: Where possible, chain multiple function calls to
optimize workflow.
3.1.2
State Observation
Claude Computer Use observes the environment solely through visual information
obtained from real-time screenshots, without relying on metadata or HTML. These
screenshots are captured during task operation, enabling the model to effectively imitate
human desktop interactions. This capability is crucial for adapting to the highly dynamic
nature of the GUI environment. By embracing the "vision-only" approach, Claude
Computer Use achieves general computer use without relying on software APIs to
perceive the environmental information, particularly for closed-source software.
5
6. 3.1.3
Reasoning Paradigm
Claude Computer Use employs a reasoning-acting paradigm for its reasoning process,
generating more reliable actions in the highly dynamic GUI environment. Similar to
traditional ReAct [37], Claude Computer Use observes the environment before deciding
on an action, ensuring that the action is appropriate for the current GUI state. Fur-
thermore, Claude Computer Use exhibits the capacity to efficiently identify when user
requirements are fulfilled, enabling it to take decisive actions without engaging in unnec-
essary steps. Interestingly, beyond traditional ReAct paradigm, which typically involves
continuous observation of the environment at each step, Claude Computer Use adopts
a more selective observation strategy. It monitors the GUI state only when necessary,
according to its reasoning. This approach effectively reduces costs and accelerates the
overall process by avoiding superfluous observations.
3.1.4
Tool Use
Currently, Claude Computer Use is provided with three Anthropic-defined tools: Com-
puter Tools, Text Editor Tools, and Bash Tools. Below are detailed descriptions of
each tool:
Computer Tools. Computer tools help Claude Computer Use operate a mouse and
keyboard to interact with a computer, and take screenshots.
Below is the description of Computer Tools:
• This is an interface to a desktop GUI. You do not have access to a terminal or
applications menu. You must click on desktop icons to start applications.
• Some applications may take time to start or process actions, so you may need to
wait and take successive screenshots to see the results of your actions. E.g. if
you click on Firefox and a window doesn’t open, try taking another screenshot.
• The screen’s resolution is {display_width_px}x{display_height_px}.
• The display number is {display_number}.
• Whenever you intend to move the cursor to click on an element like an icon, you
should consult a screenshot to determine the coordinates of the element before
moving the cursor.
• If you tried clicking on a program or link but it failed to load, even after waiting,
try adjusting your cursor position so that the tip of the cursor visually falls on
the element that you want to click.
• Make sure to click any buttons, links, icons, etc. with the cursor tip in the center
of the element. Don’t click boxes on their edges unless asked.
Below is the tool schema of Computer Tools:
Computer Tool Schema
1
2
3
4
5
6
7
8
9
{
"properties": {
"action": {
"description": """The action to perform. The available actions
are:
* ` key ` : Press a key or key-combination on the keyboard.
- This supports xdotool's ` key ` syntax.
- Examples: "a", "Return", "alt+Tab", "ctrl+s", "Up",
"KP_0" (for the numpad 0 key).
`
type
` : Type a string of text on the keyboard.
*
* ` cursor_position ` : Get the current (x, y) pixel
coordinate of the cursor on the screen.
6
7. * ` mouse_move ` : Move the cursor to a specified (x, y)
pixel coordinate on the screen.
* ` left_click ` : Click the left mouse button.
* ` left_click_drag ` : Click and drag the cursor to a
specified (x, y) pixel coordinate on the screen.
* ` right_click ` : Click the right mouse button.
* ` middle_click ` : Click the middle mouse button.
* ` double_click ` : Double-click the left mouse button.
* ` screenshot ` : Take a screenshot of the screen.""",
"enum": [
"key",
"type",
"mouse_move",
"left_click",
"left_click_drag",
"right_click",
"middle_click",
"double_click",
"screenshot",
"cursor_position"
],
"type": "string"
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
},
"coordinate": {
"description": "(x, y): The x (pixels from the left edge) and
y (pixels from the top edge) coordinates to move the mouse
to. Required only by ` action=mouse_move ` and
` action=left_click_drag ` .",
"type": "array"
},
"text": {
"description": "Required only by ` action=type ` and
` action=key ` .",
"type": "string"
}
30
31
32
33
34
35
36
37
38
},
"required": ["action"],
"type": "object"
39
40
41
42
}
Editor Tools. Computer tools help Claude Computer Use operate custom editing tool
for viewing, creating, and editing files.
Below is the description of Editor Tools:
• State is persistent across command calls and discussions with the user.
• If path is a file, view displays the result of applying cat -n. If path is a directory,
view lists non-hidden files and directories up to 2 levels deep.
• The create command cannot be used if the specified path already exists as a
file.
• If a command generates a long output, it will be truncated and marked with
<response clipped>.
• The undo_edit command will revert the last edit made to the file at path.
Notes for using the str_replace command:
• The old_str parameter should match EXACTLY one or more consecutive lines
from the original file. Be mindful of whitespaces!
7
8. • If the old_str parameter is not unique in the file, the replacement will not be
performed. Make sure to include enough context in old_str to make it unique.
• The new_str parameter should contain the edited lines that should replace the
old_str.
Below is the tool schema of Editor Tools:
Editor Tool Schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
"properties": {
"command": {
"description": "The commands to run. Allowed options are:
` view ` , ` create ` , ` str_replace ` , ` insert ` , ` undo_edit ` .",
"enum": ["view", "create", "str_replace", "insert",
"undo_edit"],
"type": "string"
},
"file_text": {
"description": "Required parameter of ` create ` command, with
the content of the file to be created.",
"type": "string"
},
"insert_line": {
"description": "Required parameter of ` insert ` command. The
` new_str ` will be inserted AFTER the line ` insert_line ` of
` path ` .",
"type": "integer"
},
"new_str": {
"description": "Optional parameter of ` str_replace ` command
containing the new string (if not given, no string will be
added). Required parameter of ` insert ` command containing
the string to insert.",
"type": "string"
},
"old_str": {
"description": "Required parameter of ` str_replace ` command
containing the string in ` path ` to replace.",
"type": "string"
},
"path": {
"description": "Absolute path to file or directory, e.g.
` /repo/file.py ` or ` /repo ` .",
"type": "string"
},
"view_range": {
"description": "Optional parameter of ` view ` command when
` path ` points to a file. If none is given, the full file
is shown. If provided, the file will be shown in the
indicated line number range, e.g. [11, 12] will show lines
11 and 12. Indexing at 1 to start. Setting ` [start_line,
-1] ` shows all lines from ` start_line ` to the end of the
file.",
"items": {"type": "integer"},
"type": "array"
}
},
"required": ["command", "path"],
"type": "object"
8
9. 36
}
Bash Tools. Computer tools help Claude Computer Use run commands in a bash shell.
Below is the description of Bash Tools:
• When invoking this tool, the contents of the command parameter does NOT need
to be XML-escaped.
• You have access to a mirror of common Linux and Python packages via apt and
pip.
• State is persistent across command calls and discussions with the user.
• To inspect a particular line range of a file, e.g. lines 10-25, try sed -n 10,25p
/path/to/the/file.
• Please avoid commands that may produce a very large amount of output.
• Please run long-lived commands in the background, e.g. sleep 10 & or start a
server in the background.
Below is the tool schema of Bash Tools:
Bash Tool Schema
1
{
"properties": {
"command": {
"description": "The bash command to run. Required unless the
tool is being restarted.",
"type": "string"
},
"restart": {
"description": "Specifying true will restart this tool.
Otherwise, leave this unspecified.",
"type": "boolean"
}
}
2
3
4
5
6
7
8
9
10
11
12
}
3.1.5
GUI Action space
The GUI action space of Claude Computer Use consists of all the raw mouse and keyboard
actions, including mouse-move, left-click, right-click, middle-click, double-click, drag,
type, keystrokes, and combinations of keys for shortcuts, among others. Coordinate-
related operations also include the target position at the pixel space of the observed
screenshot. Therefore, one action can denoted by the syntax action_type(arguments).
Here are some examples of actions that are supported in our case study:
• Mouse Movement: Move the mouse cursor to a specific position on the screen.
Example: mouse_move(100, 150)
• Mouse Clicks: Perform mouse clicks at a specified location.
Example: left_click()
• Typing and Sending Keystrokes: Simulate typing text or pressing keys.
Example: type(’Hello, world!’)
• Keyboard Hotkey Combinations: Press and release keyboard shortcuts or
hotkeys.
Example: key(’ctrl + c’)
9
10. • Drag and Drop: Perform drag and drop actions.
Example: left_click_drag(100, 200, duration=2)
• Taking Screenshots: Taking screenshots from the computer to observe.
Example: screenshot()
3.1.6
History Visual Context Maintenance
Claude Computer Use maintains an extensive context of history screenshots, which
accumulate through the ongoing task operations. Specifically, at each time step, the
retained screenshots are utilized to assist the action generation process as follows:
t−1
t
Y action
= Θ model (X instr , I t , I history
)
t−1
Y history
= I
t−1
∪
t−2
I history
(1)
(2)
t−1
t
where Y action
is the action to take at the current step t , and I history
represents the
retained historical screenshots. Here, Θ model is the parameterized Claude 3.5 Sonnet
model. In this way, the full visual information along the trace of history is preserved,
enhancing the model’s ability to make informed decisions as an episode unfolds.
3.2
Agent Implementation
3.2.1
Out-of-the-Box Agent Framework
Recognizing that the demonstration codebase from Anthropic only supports a Docker
Linux environment, which is far from enough for benchmarking GUI automation models
in real-world environments, we have developed a cross-platform, Docker-free GUI
agent framework called Computer Use Out-of-the-Box. This framework enables the
deployment of a GUI agent locally on both Windows and macOS. By utilizing PyAutoGUI,
we ensure that the operations are compatible across both operating systems, allowing
universal remote control of the software by the API-based model through specific action
commands.
4
4.1
Computer Use Ability Evaluation
Setup Details
System Config. The evaluation is conducted on both Windows and macOS via the pro-
posed Computer Use Out-of-the-Box platform. As suggested by Anthropic Computer
use API document [38], the resolution is set to (1366, 768) and (1344, 756) for Windows
and macOS, respectively.
Human Review and Evaluation. Computer use introduces extra risks that differ
significantly from those standard conversational APIs or interfaces, especially when
interacting with the internet, or potentially manipulating users’ sensitive information.
Thus, we use a human evaluation to continuously monitor and review the process. We
also manually observe the final state of a task upon completion and determine outcomes
as a "Success" or "Failed".
Case Study Scope. As shown in Figure 1 (left), we carefully collected a set of user
queries and initial states on the following widely-applicated domains to include a broad
spectrum of desktop tasks across operating systems. Specifically, in this report, we
include 20 tasks across 12 software or websites in the following 3 domains: Web Search,
Workflow, Office Productivity and Video Games.
Table 1 documents an overview of the case studies evaluated in this section, categorized
by domain and indicating whether each task was successfully completed or failed. For
quick navigation, we suggest readers to click on and navigate to their scenarios of
interest via this table.
10
11. Table 1: Summary of case studies in the report. Click on tasks to navigate to correspond-
ing sections.
Domain Site / Software Task Web Search Amazon Find ANC Headphones Under Budget $100 on Amazon Success
Web Search Apple Official Site Browse Apple Official Site for Display with Accessories Success
Web Search Fox Sport Fox Sports Subscription Workflow Apple Music Find Latest & Local Trending Music and Add to Playlist Success
Workflow Amazon & Excel Search for Products on Amazon and Record Prices in Excel Success
Workflow Google Sheet & Excel Export and Download Online Document to Open Locally Success
Workflow App Store Install App from App Store and Report Storage Usage Success
Office Productivity Outlook Forward a Specific Email and CC Another Recipient Success
Office Productivity Word Change Document Layout to A3 in Landscape Orientation Success
Office Productivity Word Two Columns Document Success
Office Productivity Word Update Name and Phone Number on Resume Template Office Productivity PowerPoint Gradient Fill Background Success
Office Productivity PowerPoint Modify Slide Title and Draw a Triangle Success
Office Productivity PowerPoint Insert Numbering Symbol Office Productivity Excel Find and Replacement in Worksheet Office Productivity Excel Insert a Sum Equation over Cells Video Games Hearthstone Create and Rename a New Deck for Battle Success
Video Games Hearthstone Hero Power Success
Video Games Honkai: Star Rail Warp Automation Success
Video Games Honkai: Star Rail Daily Mission Clean up Automation Success
4.2
Outcome
Case Study: Web Search
The World Wide Web (WWW) is a vast, open-domain interactive environment consisting
of interconnected pages with natural text, images, and numerous interactive elements.
The dynamic nature of web pages means that web search tasks place significant demands
on a model’s planning capabilities, as the model cannot simply follow a pre-defined path.
Additionally, the large number of interactive elements requires robust grounding abilities
to identify and interact with the correct elements accurately. Furthermore, given the
interconnected structure of web pages, the model should be capable of deciding when
to navigate forward or backward through the history trace, depending on the execution
status of each planned step. Therefore, the model must possess the ability to critique
and adjust its plan accordingly.
In the subsequent case studies, we evaluate the model’s performance on complex web
search tasks that reflect real-world scenarios. These tasks are designed to assess the
model’s planning, grounding, and adaptive abilities when navigating through intricate
web interfaces. By examining how the model interacts with actual websites, we aim to
demonstrate its proficiency in handling dynamic content, executing multi-step plans, and
adjusting its strategies in response to unexpected interface behaviors. This evaluation
provides insights into both the capabilities and limitations of the model within the web
search domain.
11
Failed
Failed
Failed
Success
Failed
12. ! User Query: Add a set of wireless headphones to your cart with a budget of $100 or less, that has an active noise-cancelling
feature.
Step 1
I see we're on Amazon. I'll move the
cursor to the search bar and search for
them.
‘left_click’: '[928, 113]’
I‘ve successfully opened the search bar.
Step 2
Let me search for wireless headphones with
ANC.
‘type’: ‘wireless headphones active noise-
cancelling under 100‘, ‘left_click’:
‘[1058, 110]’
I‘ve successfully entered the text and
clicked on the search button.
Step 3
I‘ve found an Active Noise Cancelling
Headphone available for $25.99 that meets
all your requirements. Let me add this to
the cart for you.
‘left_click’: ‘[762, 582]’
I‘ve successfully added the headphones to
your cart.
Figure 2: Representative task in Web Search. We categorize and highlight the response
of the model into: Planning , Action , and Critic . Please check Section 4.2.1 for detailed
discussions. Zoom in for the best view.
4.2.1
Find ANC Headphones Under Budget $100 on Amazon
Planning. The blue parts in Figure 2 demonstrate the planning process of the model.
It can be observed that the model is aware of its initial status that it is on the Amazon
main page, and then comes up with the plan that search for the target headphone using
a query extracted from the user instruction through the search box, and filter the result
items based on the budget conditions.
Action. The brown parts in the figures demonstrate the action generated from the
model. It can be observed that the model can accurately locate the center of the search
box and perform click operation. After typing in the search box, it can further click on
the search logo to start the search.
Critic. It can be observed that after Claude clicks on the "Add to Cart" button, the model
takes another screenshot to check it has successfully added the desired item to the
cart. This indicates that the model can observe the outcome of its action to dynamically
decide whether to retry or terminate the execution.
4.2.2
Browse Apple Official Site for Display with Accessories
Planning. The blue parts in Figure 3 demonstrate the planning process of the model.
It can be observed that the model can observe its initial status on the main page of
Apple’s official site. It is worth noting that if the model plans to find the desired item
by navigating through menus and sub-menus, it will be time-consuming and require
multiple difficult operations such as cursor hovering and scrolling. Therefore, the model
plans to find the target item by utilizing the search function, showcasing the efficiency of
the plan generated by the model. Moreover, when the Apple Care window pops up, the
model captures the change, and further planning according to the user instruction and
options in the pop-up window: Since the user wants all accessories, thus it is reasonable
to add the three-year AppleCare+ which has a longer coverage to the cart.
Action. As shown in the figure, the model interacts with different types of elements,
including text, buttons, and even hyperlinked images. This demonstrates the strong
grounding capacity of Claude Computer Use, as it utilizes purely visual information only,
without HTML metadata.
Critic. Claude Computer Use has demonstrated its strong critic ability through its
reflection after key steps, as shown in the green parts. It can be observed that the
model determines that it has successfully added the nano-texture option, and continues
to the tasks left. Moreover, after all the accessories have been added to the cart,
12
13. ! User Query: Add Pro Display XDR for Mac with nano texture to bag with all the accessories and check total.
Step 1
I see we already have Apple's website
open. Let's navigate to the Pro Display
XDR. I'll click on "Store" first and then
search for the Pro Display XDR.
Step 2
Now, let me search for the Pro Display
XDR. I'll click the search icon.
‘left_click’: ‘[1102, 107]’,
‘type’: ‘Pro Display XDR‘, ‘key’: ‘Enter’
‘left_click’: '[247, 107]’
Step 3
I can see the Pro Display XDR results. Let
me click on the Display option to get to
the product page.
‘left_click’: ‘[423, 535]’
Great, we're on the Pro Display XDR
product page.
Performed left_click.
Step 4
I can see the option for nano-texture
glass. Let me select that option since
that's what you requested.
‘left_click’: ‘[919, 709]’, ‘key’:
‘Page_Down’
I‘ve successfully selected nano-texture
glass. Now let's scroll down to see the
accessories options.
Step 5
I see we have options for the Pro Stand
and VESA Mount Adapter. Since you
mentioned you want all accessories, let‘s
add the Pro Stand. Let me add the VESA
Mount Adapter as well since it might be
useful depending on your setup.
Step 6
I also see there's an option for
AppleCare+ coverage. Let me add that for
complete protection.
‘left_click’: ‘[1120, 415]’
‘left_click’: ‘[919, 455]’, ‘key’:
‘Page_Down’, ‘left_click’: ‘[919, 186]’
I‘ve successfully added the Pro Stand and
VESA Mount Adapter.
Step 7
Step 8
Let me select the three years of coverage
for AppleCare+. Let me take a final screenshot to show you
the total with all selections:
‘left_click’: ‘[674, 432]’, ‘left_click’:
'[674, 624]’ I have added all the requested items to
the cart. Here's a breakdown of what's
included:
• Pro Display XDR with nano-texture
glass: $5,999.00
• Pro Stand: $999.00
• VESA Mount Adapter: $199.00
• AppleCare+ for Pro Display XDR (3
years): $499.00
Total: $7,697.00
You can either pay the full amount upfront
or use Apple Card Monthly Installments at
$599.74/mo. for 12 months.
Step 9
Finally, I‘ll help you add these items to
the bag. Let me scroll down to find the
“Add to Bag” button. I see the "Add to
Bag" button. Let me click it to add all
the selected items to your bag.
‘key’: ‘Page_Down’, ‘left_click’: ‘[919,
169]’
Step 10
Let me take a screenshot to confirm the
items have been added to your bag.
Perfect! The Pro Display XDR with nano-
texture glass and all accessories (Pro
Stand, VESA Mount Adapter, and AppleCare+)
have been added to your bag.
Figure 3: Representative task in Web Search. We categorize and highlight the response
of the model into: Planning , Action , and Critic . Please check Section 4.2.2 for more
discussions. Zoom in for the best view.
13
14. Claude Computer Use makes a final screenshot to perform a final check and list all the
accessories it has added. These critical observations greatly improve the model’s ability
to complete long and dynamic tasks, as in this case.
4.2.3
Fox Sports Subscription (Failed)
👤 User Query: Add formula 1 to my followed sports in Fox Sports.
Step 1 Step 2 Step 3
I see that Firefox is already open with
the Fox Sports website. Let me click on
the "MORE" button in the sports navigation
bar to see if we can find Formula 1 there: Let me first click on the "ACCOUNT" option
in the left menu to access the settings
where we can manage followed sports: ‘left_click’: '[1165, 392]’ I‘ve successfully opened the account
options. I apologize, but I notice that we need to
sign in to access account features. To add
Formula 1 to your followed sports, you'll
need to:
• Sign in to your Fox Sports account first
• Once signed in, you can access your
account preferences to manage your
followed sports
I see NASCAR in the menu, but I need to
find Formula 1.
‘left_click’: ‘[98, 639]’
Figure 4: Representative failure task in Web Search. We categorize and highlight the
response of the model into: Planning , Action , and Critic . The Error here is as follows:
Scrolling down the navigation panel is required to find Formula 1 in Step 2. However,
the model gives incorrect plan that go to the “Account” tab. Please check Section 4.2.3
for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 4 represent the model’s planning process. In
this case, the model recognizes Fox Sports website and decides to look for Formula
1 within the available sports categories. It initially formulates a plan to explore the
sports options by selecting the "MORE" button in the navigation menu for more sports
categories. When the model does not immediately find Formula 1 in the initial sports list,
it alters its approach, deciding to access the "ACCOUNT" menu, with the expectation
that this section might allow the user to manage more following sports.
Action. In Step 1, the model accurately identifies the location of the "MORE" button
in the sports navigation panel and clicks on it, which should expand the list to show
additional sports. In Step 2, after failing to find Formula 1 through this initial method,
the model adapts its strategy and proceeds to click on the "ACCOUNT" tab in the left-side
menu. This transition reflects the model’s flexibility in following alternative paths to
achieve the user’s intended outcome. The sequence of actions demonstrates the model’s
ability to interact with multiple sections of the interface as it attempts to locate the
desired content.
Critic. The green captions depict the model’s feedback and self-assessment process
following its actions. After accessing the "MORE" tab, the model identifies one of other
related site and re-emphasizes its targeting sport. Although the final result incorrect,
this critic sequence still reflects the model’s attempt to achieve the user’s goal by
exploring both direct navigation and re-planning alternative routes. This critic phase
demonstrates the model’s capacity to adjust its instructions dynamically based on the
current interface requirements, also shows its situational awareness when faced with
authentication barriers.
Error. The error, highlighted in red in the caption, reveals a significant oversight
in the model’s planning. The model initially attempts to locate Formula 1 within the
expanded sports categories under the "MORE" button but does not succeed. Instead
of continuing to explore the navigation panel through scrolling, the model erroneously
shifts its strategy to the "ACCOUNT" tab, mistakenly assuming that account settings
might provide the desired sport. This results in an unnecessary detour, as accessing the
14
15. "ACCOUNT" tab prompts a login requirement, that ultimately misleads task completion
and adds unnecessary complexity for the user.
This error highlights the importance of contextually aware navigation, particularly when
the model fails to locate an item on the initial interface view — a very common real-world
scenario. Instead of prematurely altering its plan, the model should prioritize further
scrolling within navigation panels to continue its search. Although this is a brief task,
it provides insights into the model’s limitations with scrolling-based navigation, and
underscores areas for enhancement. Specifically, refining the model’s approach on visual
search while maintaining continuity within the interface may strengthen its performance
in future versions.
4.3
Case Study: Workflow
Workflow tasks involve multi-application interactions or multi-targeting user queries that
require the model to seamlessly navigate and manage data across different software
platforms. In real-world scenarios, users often need to coordinate actions between
web browsers, productivity tools, and specialized applications to accomplish complex
objectives. These tasks test the model’s ability to maintain context across different
environments, handle data transfer accurately, and execute multi-step processes without
losing track in intermediate states.
The complexity of workflow tasks lies in the need for robust planning capabilities
to coordinate operations across diverse interfaces. The model must possess strong
grounding abilities to interpret and interact with varying user interface elements, which
may significantly differ or similar between applications. Additionally, it should be
capable of dynamically adjusting its plan in response to unexpected behaviors or errors
that may occur when switching contexts. In the following case studies, we assess
the model’s performance on complex workflow tasks that reflect some practical use
cases. By examining how the model integrates actions and maintains consistency in
multi-application environments or multi-targeting tasks, we aim to demonstrate its
capabilities, and identify areas for improvement within the workflow domain.
4.3.1
Find Latest & Local Trending Music and Add to Playlist
Planning. The blue captions in Figure 5 illustrate the model’s planning sequence for
locating trending music and adding specific songs to a designated playlist within Apple
Music. Initially, the model recognizes that it needs to locate the "New" tab within Apple
Music to begin the search. Once in the "New" tab, the model plans to find the first
song listed under the "Latest Songs" section and to add this song to a pre-existing
"Sonnet’s Selection" playlist. Following this, the model continues the secondary plan
to scroll through the "New" tab to locate the "Singapore Top 100" collection, where
it will select the top song and similarly add it to "Sonnet’s Selection". This planning
phase demonstrates the model’s understanding of a multi-step objective involving tab
navigation, section identification, and song selection for playlist addition, guided by the
user’s instructions.
Action. In Step 2, the model initiates navigation by clicking on the "New" tab to
transition away from the Home tab. After reaching the "New" tab, the model proceeds to
locate the first song under "Latest Songs" and selects the song options menu by clicking
on the three-dot icon next to it. Through this menu, the model selects "Add to Playlist"
and, upon the display of playlist options, identifies and clicks on "Sonnet’s Selection"
to add the song as instructed. Following this, the model initiates the second part of
the task by scrolling down to locate "Singapore Top 100." The model prefers to use the
Page Down key to simulate navigation, with repeatedly taking screenshots to verify its
position. Upon locating the "Singapore Top 100" section, the model identifies the first
song and repeats the same adding process.
15
16. ! User Query: Open Apple Music. Navigate to the “New” tab. Find the first song under “Latest Songs” and add it to the playlist
named “Sonnet‘s Selection.” Scroll down to find “Singapore Top 100” collection under the same tab. Add the 1st top song from
"Singapore Top 100" to the "Sonnet's Selection" playlist.
Step 1
Step 2
Step 3
I need to take a screenshot first to see
the current state of the screen and locate
the elements you mentioned. I can see Apple Music is open, but we're
currently on the Home tab. Let's navigate
to the "New" tab first. I can see the "Latest Songs" section.
Let's click on the menu (three dots) for
the first song "We Good" by Lil Uzi Vert:
I can see Apple Music is open. ‘left_click’: ‘[169, 447]’ ‘left_click’: ‘[1335, 1375]’
Now let's take another screenshot to see
the "New" tab content: Let's take another screenshot to see the
menu options:
Step 4
Now I'll click on "Add to Playlist":
‘left_click’: ‘[1453, 762]
Let's take another screenshot to see the
playlist options:
Step 5
Step 6
I see "Sonnet's Selection" in the playlist
options. Let me click on it: Now, let's scroll down to find the
"Singapore Top 100".
‘left_click’: ‘[1900, 886]’ ‘key’: ‘Page_Down’
Let me take a screenshot to locate
"Singapore Top 100".
Step 7
Step 8
Step 9
I can’t locate "Singapore Top 100". I'll
press Page Down one more time. I can see the Singapore Top 100. Let me
click on it: Now I'll click the menu (three dots) for
the first song "APT." by ROSÉ, Bruno Mars:
‘key’: ‘Page_Down’ ‘left_click’: ‘[674, 432]’ ‘left_click’: ‘[3643, 1139]’
Let me take a screenshot to locate
"Singapore Top 100” again. Let's take another screenshot to see the
top song:
Finally, I'll click on "Sonnet's Selection":
‘left_click’: ‘[2957, 1297]’
Step 10
Step 11
Now I'll click "Add to Playlist":
I have completed all the requested tasks:
1. Navigated to the "New" tab
2. Found the first song under "Latest Songs"
and added it to "Sonnet's Selection" playlist
3. Scrolled down to find "Singapore Top 100”
4. Clicked on "Singapore Top 100" Added the
first song (APT. by ROSÉ, Bruno Mars) to
"Sonnet's Selection" playlist
‘left_click’: ‘[3342, 1173]’
Figure 5: Representative task in Workflow. We categorize and highlight the response
of the model into: Planning , Action , and Critic . Please check Section 4.3.1 for more
discussions. Zoom in for the best view.
Critic. After navigating to the "New" tab, the model confirms that it has successfully
reached the appropriate section before proceeding to locate "Latest Songs." This verifi-
cation process is repeated upon opening the options menu for the first song and after
accessing the playlist options. When adding songs to "Sonnet’s Selection," the model
16
17. confirms each action to ensure it aligns with the user’s specified goal. Besides, in the
scrolling phase, the model validates each step by periodically capturing screenshots
to assess its current position. This iterative verification is essential for ensuring the
target is located without a miss. Once the model reaches the top song in "Singapore
Top 100" and adds it to the playlist, it provides a final confirmation of task completion.
This feedback loop illustrates the model’s capacity for continuous monitoring through
repeatedly visual confirmation, which can be critical for tasks involving multi-paged
navigation.
4.3.2
Search for Products on Amazon and Record Prices in Excel
! User Query: Search for 27 inch, 165hz gaming monitors on Amazon, record the first 2 returned product name and prices into
Excel worksheet.
Step 1
I'll help you search for monitors on
Amazon and record the data in Excel. Let
me break this down into steps:
1. First, I'll open Firefox to go to
Amazon
2. Search for the monitors
3. Open Excel
4. Record the data
Let me start by taking a screenshot to see
the desktop:
I see we are now on the Amazon.
Step 2
Step 3
I'll search for the monitors on Amazon.
Let me type in the search: Now I'll open Excel to record this data.
Let me click on the Excel icon:
‘left_click’: ‘[1732, 250]’, ‘type’: ‘27
inch 165hz gaming monitor’, ‘key’:
‘Return’ ‘left_click’: ‘[391, 2115]’
Step 4
Step 5
Let me wait for Excel to open and then I'll
type in the data. Let me take another
screenshot to make sure Excel is open: Now I'll create headers and enter the data
from the Amazon search results. Let me
click on cell A1 first:
I see the Excel is opened. ‘left_click’: ‘[138, 546]’, ‘type’:
‘Product’, ‘keys’: ‘Tab’, ‘type’: ‘Price’
Step 7
Step 6
Now I'll enter the data for the first 2
monitors from the Amazon search results:
‘type’: ‘AOC 27G15 27 inch Gaming Monitor,
Full HD 1920x1080, 180Hz 1ms’, ‘keys’:
‘Tab’, ‘type’: ‘$129.99’
Step 8
And finally, let me continue with the
second monitor: Let me take a final screenshot to confirm
the numbering has been applied:
‘type’: ‘Acer Nitro 27" Full HD 1920 x 1080
1500R Curve PC Gaming Monitor’, ‘keys’:
‘Tab’, ‘type’: ‘$129.99’ I‘ve successfully recorded AOC and Acer
gaming monitor in Excel.
Figure 6: Representative task in Workflow. We categorize and highlight the response
of the model into: Planning , Action , and Critic . Please check Section 4.3.2 for more
discussions. Zoom in for the best view.
Planning. The blue captions in Figure 6 illustrate the model’s planning process in
carrying out a multi-application task. The model formulates a sequential plan that
17
18. involves searching for specific monitors on Amazon, then recording the data in an Excel
worksheet. The first step of the plan requires the model to open Amazon and search for
“27 inch 165hz gaming monitor.” Following this search, the model plans to switch to
Excel and input the product names and prices of the first two returned search results
into designated cells. This plan demonstrates the model’s ability to integrate multiple
different software while maintaining coherence with the user’s specified query.
Action. In Step 2, the model initiates a left-click on the Amazon search bar, types in the
search query "27 inch 165hz gaming monitor," and presses "Return" to generate search
results. Following the successful display of results, the model opens Excel by locating
and clicking on the Excel icon on the bottom taskbar in Step 3. Upon confirming that
Excel opened, the model proceeds to click on cell A1 and types in the header "Product,"
followed by pressing the "Tab" key to move to cell B1, where it enters the header "Price."
Once the headers are established, the model navigates to cell A2 to enter the details of
the first search result. It types "AOC 27G15 27 inch Gaming Monitor, Full HD 1920x1080,
165hz 1ms" and moves to the adjacent cell B2 to type in the corresponding price of
"$129.99." In Step 7, the model repeats this process for the second product, entering
"Acer Nitro 27’ Full HD 1920 x 1080 1500R Curve PC Gaming Monitor" in cell A3 and
"$129.99" in cell B3. Each action is specifically directed to either a cell or interface
component. This sequence of actions reflects a high level of accuracy in both data entry
and interface navigation across user’s OS.
Critic. After opening Excel, the model takes a screenshot to confirm that the application
is ready for data entry, reflecting an awareness of potential delays in loading time. This
extra check ensures that no data entry attempts occur before Excel is fully functional.
Additionally, after entering the product data, the model captures another screenshot in
Step 8 to verify that both products have been recorded correctly with headers applied.
This final confirmation step indicates the model’s ability to validate its own output by
re-assessing the Excel sheet visually, which enhances task reliability and minimizes the
chance of data entry errors. This feedback mechanism reinforces the model’s accuracy
and attention to detail in transferring data across software.
4.3.3
Export and Download Online Document to Open Locally
Planning. The blue captions in Figure 7 represent the model’s planning phase as it
prepares to download and open a Google Spreadsheet locally in Microsoft Excel. Initially,
the model forms a plan to access the File menu within the Google Spreadsheet, locate
the Download option, and select the correct ".xlsx" format for export. After starting the
download, the model plans to switch to Excel via opening the downloaded file. Note that
Excel is automatically opened here since it is the default software for ".xlsx" extension.
This sequence demonstrates the model’s understanding of a cross-application workflow
that begins in a web-based environment and culminates in local software, ensuring a
compatibility between the Google Spreadsheet and Excel interactions.
Action. In Step 1, the model clicks on the File menu in the Google Spreadsheet,
anticipating that this will show options for exporting the document. Following this, the
model navigates through the menu to locate and click on the Download option in Step
2. Once the download menu appears, the model selects the "Microsoft Excel (.xlsx)"
format in Step 3, triggering the file download. The model then observes the download
notification in Firefox and clicks on the downloaded file to open it in Excel. This set of
actions demonstrates the model’s proficiency in navigating menu hierarchies, along with
its universal ability to operate across browser and desktop environments.
Critic. After selecting the Download option in Step 2, the model captures a screenshot
to confirm that the correct menu options are visible. It then takes another screenshot in
Step 4 to verify that the downloaded file has appeared in Firefox’s download bar, ensuring
the download process has completed successfully. Also, after switching to Excel in Step
5, the model confirms that Excel is opening the file. It also reasons from its knowledge
that due to Excel’s Protected View mode, the model should prepare to enable editing
18
19. ! User Query: Export and download the google spreadsheet and open locally.
Step 1
The spreadsheet is opened in firefox.
I'll help you export this spreadsheet. Here's
what we'll do:
1. Click on File menu
2. Click on Download
3. Select Microsoft Excel (.xlsx) format
4. Once downloaded, we'll open it in Excel
Step 2
Let’s take another screenshot to locate
the Download option. And I'll click on the
Download option.
‘left_click’: ‘[245, 827]’
Let's take another screenshot to see the
download format options:
‘left_click’: ‘[160, 287]’
Step 3
Now I'll click on "Microsoft Excel
(.xlsx)" option:
‘left_click’: ‘[1026, 841]’
The file should now be downloading. Once
it's downloaded, we'll need to open it in
Excel. Let's wait a moment for the
download to complete and then we can
switch to Excel to open the file.
Now let's wait a moment for the menu to
appear.
Step 5
Step 4
I see the downloaded file notification at
the top of Firefox. Let me click on it to
open the file: This should open the file in Excel. Let's
wait a moment and take another screenshot
to confirm Excel has opened with our file:
‘left_click’: ‘[2648, 250]’, I see Excel is starting up. Since it's
opening in Protected View, we'll need to
click "Enable Editing" when it appears.
Let's wait a moment for Excel to fully
load and then I'll help you enable editing
of the file.
I see the Excel is opened.
Step 6
Great! Now I can see the exported file in
my Excel successfully.
Figure 7: Representative task in Workflow. We categorize and highlight the response
of the model into: Planning , Action , and Critic . Please check Section 4.3.3 for more
discussions. Zoom in for the best view.
to allow full access to the document’s contents. The final verification step confirms
that the document has been successfully exported and opened in Excel. The model’s
self-assessment throughout each phase demonstrates a high level of precision, and even
that the file is ready for local editing. This case highlights the model’s capabilities in
seamless transition between cloud and local environments.
4.3.4
Install App from App Store and Report Storage Usage
Planning. The blue captions in Figure 8 outline the model’s planning for the task
of searching, installing, and reporting on the storage usage of the "GoodNotes" app
from the App Store. Initially, the model plans to search for "GoodNotes" using the App
Store’s search function. Upon locating the app in the search results, the model intends
to view its details to download and confirm the app size. Finally, the model plans to
proceed with the installation by clicking the "Get" and "Install" buttons, and to report
the app’s storage size as displayed. This comprehensive plan demonstrates the model’s
capability to execute a multi-step installation process, from initial search to the final size
verification based on the user’s request.
Action. In Step 2, the model clicks on the search box within the App Store and types
"GoodNotes," pressing the "Return" key to generate search results. Upon locating
"GoodNotes 6" in the search results in Step 3, the model precisely clicks on the center
of app icon to access its details. The model then proceeds to initiate the installation by
clicking on the "Get" button and waiting for the "Install" button to appear. In Step 5, the
model clicks the "Install" button to begin the download.
19
20. ! User Query: Search for “GoodNotes” in the App store. Click get and install to download it to my device. Check and report its
size.
Step 1
I'll help you search for GoodNotes in the App
Store. First, let me take a screenshot to see
the desktop and locate the App Store icon.
I can see the App Store is already open.
Step 2
Let me click on the search box and type
"GoodNotes".
‘left_click’: ‘[43, 265]’, ‘type’:
‘GoodNotes’, ‘key’: ‘Return’
‘left_click’: ‘[463, 321]’
Let me wait a moment for the search results
to load and take another screenshot to see
the GoodNotes app details.
Let me wait for the app page to load and
take another screenshot to see the app
details.
Step 5 Step 6
I see the "Install" button has appeared.
Let me click it to start the installation. I see there is a authenticate needed to
continue. Once your approval, the app
should begin downloading and installing on
your device.
To summarize:
• I found GoodNotes 6 in the App Store
• The app size is 1.39 GB
• I clicked the Get button and then the
Install button to start the download and
installation process
Step 4
I can see that GoodNotes 6 is 1.39 GB in
size. Let me click the "Get" button to
download it.
Step 3
I can see GoodNotes 6 in the search
results. Let me click on it to view more
details including its size.
‘left_click’: ‘[480, 288]’,
‘left_click’: ‘[480, 288]’,
Now, let me wait for any authentication
prompts or the install button to appear
and take another screenshot.
Figure 8: Representative task in Workflow. We categorize and highlight the response
of the model into: Planning , Action , and Critic . Please check Section 4.3.4 for more
discussions. Zoom in for the best view.
Critic. After searching for "GoodNotes," the model waits for a moment to confirm that
the correct app appears in the search results, and it takes a screenshot for verification.
In Step 4, the model reports that the app size is 1.39 GB, confirming that it has accessed
the correct information as requested. The final confirmation highlights the need for user
authentication before the installation can proceed, summarizing the task’s completion:
the app was found, its size was noted, and the installation steps were initiated await
user’s authentication. Although the task is not fully executed, the model understands the
exact state to to stop as user’s intervention is required. This feedback loop showcases
the model’s attention to detail, ensuring each stage of the process is documented and
verified, ultimately confirming that the installation sequence has been initiated and is
awaiting user intervention.
4.4
Case Study: Office Productivity Software
Office productivity software is the most popular and widely used integral of modern
workplaces and educational environments. These applications are central to a wide array
of tasks, from drafting documents and analyzing data to creating impactful presentations.
Automating tasks within these applications can significantly enhance efficiency, reduce
repetitive workload, and minimize human errors, making it a key area for GUI automation
models to address.
However, unlike web environments that often provide APIs or structured HTML for
automation, Office productivity applications typically lack such programmatic interfaces
for file manipulation. Therefore, the GUI automation model must interact directly with
20
21. the application’s visual interface, as a human user would. This involves grounding their
actions in visual elements such as menus, buttons, text fields, and table cells. The
model must accurately perceive and interpret these interface components to navigate
through complex menus, execute commands, and manipulate content within documents
or spreadsheets. This visual interaction approach introduces unique challenges: (i)
The interfaces of Office applications are often intricate and densely populated with
features, requiring the model to have robust visual grounding capabilities to identify
and locate the correct elements reliably. (ii) Precise action execution is essential to
interact effectively with these elements, as even minor inaccuracies can lead to incorrect
outcomes or unintended changes. (iii) Additionally, the model must handle variations
in interface layouts and themes, which can differ based on software versions or user
customizations.
In the following case studies, we examine the model’s performance in automating
tasks within Office productivity software, focusing on its ability to plan, execute, and
adapt actions effectively. These tasks are built to simulate common real-world sce-
narios encountered by users, providing insights into the model’s potential to enhance
productivity within word processing, spreadsheet, and presentation applications. By
analyzing the model’s interactions with these interfaces, we can gain a preliminary
understanding of how it can support or augment human productivity in widely used
software environments.
4.4.1
Forward a Specific Email and CC Another Recipient
! User Query: Forward the most email from Anthropic in Outlook. Please carefully Locate the “Forward” option in the top right
corner. Then forward the email to “siyuanhu@nus.edu.sg”, and cc “ouyangmingyu04@u.nus.sg”.
Step 1
I'll help you with the task in Outlook. Let
me start by taking a screenshot to see the
current state of the window.
I can see the Anthropic email in the list.
Step 2
Let me click on it first to open it.
‘left_click’: ‘[320, 227]’
Let me take another screenshot to confirm
the email is open and locate the Forward
button.
‘left_click’: ‘[1277, 188]’, ‘type’:
‘siyuanhu@nus.edu.sg’, ‘keys’: ‘Enter’
‘left_click’: ‘[1310, 142]’
Step 5 4
Step
Step 4
Now let me type in the recipient‘s email
address, and press Enter to confirm the
recipient:
Step 3
I see the Forward button (right-facing
arrow) in the top right corner. Let me
click it.
Now let‘s click on the CC field. Now let me
type the CC recipient:
‘left_click’: ‘[1277, 342]’, ‘type’:
‘ouyangmingyu04@u.nus.sg’, ‘keys’: ‘Enter’
I see the Excel is opened.
I see the Excel is opened.
Step 6
Finally, let's click the Send button:
‘left_click’: ‘[695, 94]’,
I've completed all the requested steps:
1.Opened the Anthropic email
2.Clicked the Forward button
3.Added siyuanhu@nus.edu.sg as the recipient
4.Added ouyangmingyu04@u.nus.sg to CC field
5.Clicked Send
Figure 9: Representative task in Office Productivity Software. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.4.1 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 9 outline the model’s planning for forwarding a
specific email in Outlook. The task requires the model to locate the latest email from
Anthropic in the inbox, open it, and use the Forward option located in the top right
21
22. corner. The email is to be forwarded to the primary recipient, “siyuanhu@nus.edu.sg,”
and CC’d to “ouyangmingyu04@u.nus.sg.” This planning phase demonstrates the model’s
understanding of email workflow management, integrating email selection, forwarding
operations, and address entry step by step.
Action. In Step 2, the model clicks to open the Anthropic email from the inbox, confirm-
ing that the message is displayed in the reading pane. In Step 3, the model identifies
and clicks on the Forward button, represented by a right-facing arrow in the top right
corner of the interface. Following this, the model clicks on the recipient field and types
in "siyuanhu@nus.edu.sg" as the main recipient, pressing "Enter" to confirm. In Step
5, the model clicks on the CC field and adds "ouyangmingyu04@u.nus.sg" as the CC
recipient. Once both addresses are in place, the model completes the process by clicking
the Send button in final step.
Critic. Firstly, the model identifies the specified Anthropic email from user in Outlook
interface. After opening the Anthropic email, the model confirms that the message is
visible, ensuring the correct email is being processed. Following each action, such as
entering recipient addresses and confirming the CC field, the model captures screenshots
to verify that the inputs are correctly applied. In the final step, the model confirms
that all required actions—opening the email, forwarding it, adding the recipient and
CC addresses, and clicking Send—have been successfully completed. This systematic
verifying approach of each action highlights its capacity for self-monitoring and task
accuracy. This successful example demonstrates the model’s proficiency in handling
email tasks in a professional context.
4.4.2
!
Change Document Layout to A3 Size in Landscape Orientation
User Query: Change the document layout to A3 size paper, landscape orientation in Microsoft Word.
Step 1
Here's how we'll do it:
1. Click on the “Layout” tab. 2. Click on
“Size” to change the paper size. 3. Click
on "Orientation" to change to landscape.
‘left_click’: ‘[306, 63]’, ‘left_click’:
'[223, 128]’
Now that we‘ve changed the layout.
Step 2
Now let‘s change the orientation to
landscape. I’ll click on the Orientation
dropdown, and let me click on Landscape.
‘left_click’: ‘[165, 130]’, ‘left_click’:
'[272, 415]’
Step 3
The document has now been changed to A3
size in landscape orientation. The changes
should be automatically applied to your
document.
I‘ve successfully changed the orientation
to landscape.
Figure 10: Representative task in Office Productivity Software. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.4.2 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 10 illustrate the model’s planning phase for
adjusting the document layout in Microsoft Word. The model begins by identifying the
necessary actions to achieve the requested layout: opening the "Layout" tab, selecting
"Size" to change the paper dimensions to A3, and finally setting the orientation to
"Landscape." This structured plan highlights the model’s familiarity with Word’s layout
controls, and a clear sequential approach to make specific page layout adjustments as
the user’s request.
Action. The brown captions detail the model’s actions to execute the adjustment. In
Step 1, the model clicks on the "Layout" tab and selects "Size," choosing A3 from the
dropdown options to apply the new paper size. In Step 2, the model clicks on the
"Orientation" dropdown and selects "Landscape" to adjust the document’s orientation.
22
23. Critic. The green captions represent the model’s feedback on layout adjustment comple-
tion. After applying both the A3 size and landscape orientation, the model confirms that
the document has been successfully adjusted to the specified layout. This verification
assures that the requested adjustment have been applied, providing final confirmation
that the document formatting now aligns with the user’s requirements.
4.4.3
Two Columns Document
! User Query: Change document layout from one column to two columns in Microsoft Word
Step 1
I see Microsoft Word is already open. To
change the document from one column to two
columns, we need to:
1. Click on the Layout tab
2. Find the Columns option
3. Select Two columns
Let me do this step by step:
Step 2
Let me do this step by step.
to the Layout tab.
Now let‘s go
Step 3
Now that we're in the Layout tab, let's
look for the Columns button.
‘left_click’: ‘[276, 71]’ ‘left_click’: ‘[386, 128]’
Now that we're in the Layout tab. A dropdown menu should appear.
Step 4
Step 5
Let‘s select the Two columns option in the
dropdown menu. The document should now be formatted in
two columns. Let's verify the change.
‘left_click’: ‘[386, 163]’ Perfect! The document has now been
successfully changed from one column to
two columns. The text is now flowing in a
two-column layout.
Figure 11: Representative task in Office Productivity Software. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.4.3 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 11 outline the model’s planning process to
convert the document layout in Microsoft Word from a single column to a two-column
format. The model begins by identifying that it needs to access the “Layout” tab, where
the "Columns" option is located. The plan is to select the “Two columns” setting from
this menu, thereby reformatting the text into two columns as per the user’s instruction.
This structured approach demonstrates the model’s understanding of how to access
layout features in Word to alter document structure.
Action. In Step 2, the model clicks on the “Layout” tab to reveal the layout options.
Upon entering the Layout tab, the model locates and clicks on the “Columns” button
in Step 3. This action opens a dropdown menu where the model selects the “Two”
option to reformat the document into a two-column layout. Each action showcases the
model’s accuracy in interacting with Word’s interface, specifically targeting the layout
modification options.
Critic. After selecting the two-column layout, the model confirms that the document’s
text has been successfully reorganized into two columns. This confirmation ensures
the layout change has been applied as intended. The model’s final feedback provides
assurance that the document now reflects the user’s requested format, indicating the
successful completion of the task.
23
24. 4.4.4
Update Name and Phone Number on Resume Template (Failed)
! User Query: In this resume template in Microsoft Word, change the name to ‘Sonnet’, and update my phone number with "7355608"
Step 1
I can see the Microsoft Word document is
already open with a resume template. I'll
help you make the requested changes:
First, let's change the name from "Janna
Gardner" to "Sonnet". I'll select the name
text and replace it:
‘double_click’: ‘[673, 305]’, ‘type’:
‘Sonnet‘
Step 2
Now, let's update the phone number. I'll
select the current phone number and
replace it with "7355608":
‘left_click’: ‘[98, 639]’
I‘ve successfully entered
options.
Step 3
I've made both requested changes to the
resume:
1.Changed the name to "Sonnet"
2.Updated the phone number to "7355608"
the account
Figure 12: Representative failure task in Office Productivity Software. We categorize
and highlight the response of the model into: Planning , Action , and Critic . The Error
here is as follows: In Step 1, though the model gives a correct plan to replace “Janna
Gardner” to “Sonnet”, the model only selects the last name (via a ‘double_click‘)
to change. In step 2, the model also fails to select the whole phone number. In
step 3, model give an incorrect critic that assume the task is successfully completed.
Please check Section 4.4.4 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 12 illustrate the model’s planning for updating
the name and phone number in a Microsoft Word resume template. Initially, the model
formulates a plan to locate and replace "Janna Gardner" with "Sonnet" and to update
the existing phone number to "7355608." Though failed in execution, this demonstrates
the way in which the model attempts to locate specific text fields within the document
and performing editing based on the user’s instructions.
Action. The brown captions outline the specific actions performed by the model to
implement the planned changes. In Step 1, the model selects the name "Janna Gardner"
by double-clicking on it and types "Sonnet" as the replacement. In Step 2, the model
proceeds to update the phone number by selecting it and entering "7355608" as the new
value.
Critic. The green captions provide the model’s feedback on task completion, confirming
that both changes were successfully applied. However, this confirmation stems from
the model’s hallucination and overlook of the resulting interface. Specifically, after
replacing the name and updating the phone number, the model summarizes the cor-
rect modifications, but incorrectly assuming that both replacements were executed as
intended.
Error. In Step 1, an error occurs in the name replacement process. Although the model
correctly identifies the target to replace "Janna Gardner" with "Sonnet," it only selects
the last name, "Gardner," when performing the replacement. This generated from model
performed a "double_click" instead of a dragging selection. Thus, resulting the first name
"Janna" remains unchanged. Furthermore, in Step 2, the model encounters a similar
issue when updating the phone number. Instead of selecting the entire number, it selects
only a portion of it, resulting in an incomplete update of the contact information. More
problematic, in Step 3, the model incorrectly assumes that the task has been completed
successfully, providing a critic that overlooks the partial updates. This incorrect feedback
suggests a limitation in the model’s text selection accuracy, highlighting the need for
improved selection capabilities and providing more accurate validation feedback.
24
25. ! User Query: Replace the default background with a gradient fill background in the new blank slide in PowerPoint.
Step 1
I'll help you create a new blank slide
with a gradient fill background in
PowerPoint. Let me break this down into
steps:
1. First, let's open PowerPoint
2. Create a new blank slide
3. Apply a gradient fill background
Let me start by taking a screenshot to see
the desktop and then open PowerPoint.
Step 2
Now let's apply a gradient background to
this slide. Here's how we'll do it:
First, let's right-click on the slide to
access the background options.
‘right_click’: ‘[500, 400]’
Step 3
Let me take another screenshot to see the
Format Background pane :
I can’t see Format Background pane. I need
to access the Design tab to get to the
background formatting options.
I see PowerPoint is already open.
Step 4
Step 5
I need to access the Design tab to get to
the background formatting options. Let me
click on that: Now I‘ll click on the Format Background
button, and select the gradient fill
option.
‘left_click’: ‘[247, 63]’ ‘left_click’: ‘[1196, 143]’
Step 6
Now I'll select the gradient fill option.
‘left_click’: ‘[1122, 340]’
Perfect! A gradient fill has been applied
to the slide.
I‘ve successfully opened Format Background
pane. Let‘s take another screenshot to see
the Format Background pane.
Figure 13: Representative task in Office Productivity Software. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.4.5 for more discussions. Zoom in for the best view.
4.4.5
Gradient Fill Background
Planning. The blue captions in Figure 13 illustrate the model’s planning phase for
applying a gradient fill background to a new blank slide in PowerPoint. The model
outlines the steps it will take: first, to open PowerPoint and create a new blank slide, and
then to apply a gradient background by accessing the background formatting options. In
step 3, the model takes a screenshot but does not find the desired Format Background
pane, so the model re-plan the actions in step 4 to access the Design tab. With this
alternative plan, the model still reaches the success of the task. This showcases the
model’s familiarity to the basic operations in the PowerPoint to access the desired
functions in various ways.
Action. In Step 2, the model right-clicks on the slide, initially aiming to access the
background formatting options. However in this execution, the model actually right-
clicked on the title textbox so that the background formatting option is not visible theres.
When the expected "Format Background" pane does not appear as anticipated, the
model reassesses the approach in Step 4 and decides to access the Design tab directly.
From here, the model successfully locates the "Format Background" button. Then in the
Format Background pane, the model clicks on the "Gradient Fill" option in Step 6, which
applies the gradient fill to the slide.
Critic. After accessing the Format Background pane and selecting the gradient fill
option, the model confirms that the gradient fill has been successfully applied to the
slide in step 6. This final check ensures that the requested background modification
25
26. was completed as specified. More interestingly, in step 3 when the model cannot find
the desired Format Background pane, the model quickly adjusts its plan to access the
Design tab. This sequence of critic observation highlights the model’s ability to adapt its
approach when the initial method does not yield the desired outcome.
4.4.6
Modify Slide Title and Draw a Triangle
! User Query: Add a title "Triangle" on the slide, and insert a corresponding shape in my Mac's PowerPoint.
Step 1
I'll help you add a title and a triangle
shape in PowerPoint. I'll need to take
several steps:
First, let's see what's on the screen
Open PowerPoint if it's not already open
Add the title "Triangle"
Insert a triangle shape
Let me start by taking a screenshot to see
the current state:
Step 2
Let's add the title first. I'll click on
the "Click to add title" box and type
"Triangle":
‘left_click’: ‘[500, 400]’, ‘type’: ‘
Triangle’
Step 3
Now, let's insert a triangle shape. I'll
need to:
1.Click on the "Shapes" button
2.Select a triangle from the shapes menu
3.Draw it on the slide
‘left_click’: ‘[485, 136]’
I see PowerPoint is already open.
Step 4
Now I'll select a triangle shape from the
shapes menu and draw it on the slide. Let
me click in the content area and drag to
create the triangle.
I see the shapes menu is now open. Let me
select the triangle shape from the Basic
Shapes section:
‘left_click_drag’: ‘[779, 500]’ to ‘[879,
600]’
Step 5
Let me take a final screenshot to show you
the result:
I've completed both tasks:
1.Added the title "Triangle" to the slide
2.Inserted a triangle shape in the content
area
The slide now shows the title "Triangle"
and has a blue triangle shape below it.
Figure 14: Representative task in Office Productivity Software. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.4.6 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 14 describe the model’s plan to modify a slide in
PowerPoint by adding the title "Triangle" and inserting a triangle shape below it. The
model breaks down the task into several steps: first, to add the title by typing "Triangle"
in the title box, and then to insert a triangle shape by selecting it from the shapes menu
and drawing it on the slide. This structured plan indicates the model’s understanding of
the PowerPoint interface interaction and the sequence required to complete the task.
Action. In Step 2, the model clicks on the "Click to add title" box and types "Triangle"
as the new title. Moving to Step 4, the model accesses the "Shapes" button, opens the
shapes menu, and accurately locates the triangle shape within the Basic Shapes section.
It then selects the triangle shape and performs a click-and-drag action to draw the shape
on the slide below the title. Each action of this task is specifically targeted to ensure
that the modifications appear correctly on the slide.
Critic. The green captions provide the model’s feedback and confirm that both requested
elements were successfully added. After adding the title and inserting the triangle, the
model verifies that the slide now displays the title "Triangle" with a blue triangle shape
26
27. positioned below it. This final confirmation step reassures that the visual modifications
were completed. This example demonstrates the model’s effective control on more
diverse functionality of slides than just layout adjustments.
4.4.7
Insert Numbering Symbol (Failed)
! User Query: Add numbering symbols to the text of the second slide.
Step 2
Step 1
I can see PowerPoint is already open with
a presentation. Let me help you add
numbering symbols to the second slide.
I'll first click on the second slide in
the left panel, then add the numbering.
‘left_click’: ‘[103, 308]’
Now let's click on the bullet/numbering
button in the Home ribbon. I see the
numbering options in the toolbar.
‘left_click’: ‘[98, 639]’
Let me take a screenshot to confirm the
changes have been applied. I've
successfully applied bullet points to the
text on the second slide.
Step 4
Step 3
I've successfully applied bullet points to
the text on the second slide. However,
since you specifically asked for numbering
symbols, let me change it from bullets to
numbers. I'll click the numbering option
instead.
‘left_click’: ‘[618, 106]’
I notice that I need to select the text
first before applying the numbering. Let me
do that.
Step 5
Step 6
I need to select the text first before
applying the numbering. Let me do that: Now with the text selected, let me apply
the numbering: Let me take a final screenshot to confirm
the numbering has been applied:
‘left_click_drag’: ‘[464, 466]’ to ‘[464,
566]’ ‘left_click’: ‘[647, 106]’ I've successfully added numbering to the
text on the second slide. The list should
now show numbers (1, 2, 3, 4) instead of
bullet points for each project.
Figure 15: Representative failure task in Office Productivity Software. We categorize
and highlight the response of the model into: Planning , Action , and Critic . The Error
here is as follows: In Step 2, the model erroneously applied bullet points instead
of the requested numbering symbols. In Step 3, the model identified this discrepancy
and attempted to adjust its plan accordingly. However, the model encountered further
challenges in failing to select all text using the ‘left_click_drag‘, and also in locating
the numbering button. Finally, in Step 6, the model did not correctly evaluate the
final outcome to verify the success of the applied numbering format. Please check
Section 4.4.7 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 15 indicate the model’s planning phase, where
it formulates a step-by-step approach to adding numbering symbols to the text on the
second slide in PowerPoint. The model begins by selecting the slide from the left panel
and then plans to use the numbering option in the toolbar to apply numbered formatting
to the listed items. This initial plan demonstrates the model’s awareness of the tools
available in PowerPoint’s interface for modifying text formatting.
Action. In Step 2, the model clicks on the second slide in the left panel to bring it into
view and then attempts to apply numbering by clicking the "bullet/numbering" button in
the toolbar. However, this initial action applies bullet points selection, as an oversight
that the model finds in Step 3. It then attempts to switch from bullet points to numbering
by selecting the numbering option. After realizing that the text needs to be selected
before applying the numbering, the model clicks and drags to select the list items in
Step 5. Following this selection, it attempts again to apply numbering. Despite this
27
28. adjusted approach, the model encounters difficulty in this numbering format due to
limitation in selection accuracy.
Critic. In Step 3, after mistakenly applying bullets, the model acknowledges the
discrepancy between the user’s request for numbering and the initial application of
bullets. To address this issue, the model adjusts its approach to correct the formatting
by explicitly applying numbering. In Step 6, however, the model does not accurately
evaluate whether the final outcome shows the correct numbering symbols.
Error. In Step 2, the model incorrectly applies bullet points instead of numbering
symbols, deviating from the user’s request. Although the model recognizes this error
and attempts to switch to numbering in Step 3, it encounters further challenges. The
model struggles with accurately selecting the entire text using ’left_click_drag’ and
subsequently has difficulty applying the numbering option. Finally, in Step 6, the
model fails to thoroughly assess the final result to ensure that numbering, rather than
bullets, has been applied. The numbering in the slide content may mislead the model’s
observation.
This error section highlights limitations in the model’s accuracy of precise selection or
locating the small functional buttons. Enhanced selection control and a more robust
final verification step could improve the model’s ability to meet future requirements.
4.4.8
Find and Replacement in Worksheet
! User Query: Replace all ‘$’ with ‘SGD’ in the current Microsoft Excel worksheet.
Step 1
Step 2
Now I'll perform the find and replace
operation to change all '$' symbols to
'SGD'. Here's how I'll do it:
1. Press Ctrl+H to open the Find and
Replace dialog:
I'll help you perform this find and
replace operation in Excel. First, let me
verify if Excel is open by taking a
screenshot, and if not, I'll open it.
I see that Excel is already open with a
wedding budget template.
Step 3
2.
Now I'll type '$' in the "Find what"
field:
‘type’: ‘$’
‘key’: ‘Ctrl+h’
3.
Now I'll press Tab to move to the
"Replace with" field and type 'SGD':
‘key’: ‘Tab’, ‘type’: ‘SGD’
4.
Now I'll click "Replace All" to replace
all instances
‘left_click’: ‘[959, 700]’
Step 4
Excel confirms that 190 replacements were
made. All dollar signs have been
successfully replaced with SGD in the
worksheet.
Step 5
Figure 16: Representative task in Office Productivity Software. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.4.8 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 16 illustrate the model’s planning process for
performing a find-and-replace operation in an open Excel worksheet. The model first
confirms whether Excel is open and then plans to open the "Find and Replace" dialog
to replace all instances of the "$" symbol with "SGD." The steps outlined in the plan
include accessing the dialog, specifying the text to find ("$") and the replacement text
("SGD"), and executing the replacement across the entire worksheet. Moreover, the
model plans to use a keyboard shortcut "Ctrl+H" to execute this replacement. This plan
28
29. reflects the model’s familiarity with Excel’s functionality and its ability to structure a
multi-step approach.
Action. In Step 2, the model uses the Ctrl+H shortcut to open the "Find and Replace"
dialog. It then enters "$" in the "Find what" field and, after pressing the "Tab" key to
move to the "Replace with" field, types "SGD" as the replacement text. In Step 4, the
model clicks the "Replace All" button to apply the changes throughout the worksheet.
This sequence of actions demonstrates the model’s effective command over Excel’s
interface and shortcuts.
Critic. The green captions reflect the model’s verification process. After executing the
replacement, the model observes the confirmation message from Excel, which indicates
that 190 replacements were made. This feedback solidly confirms the model’s successful
completion of the requested operation. Such a final check ensures that the replacement
was accurately executed across the entire worksheet as intended.
4.4.9
Insert a Sum Equation over Cells
! User Query: Insert an Excel function to sum over ‘ACTUAL' column in the missing 'total budget' in Microsoft Excel.
Step 1 Step 2
I‘ll help you insert a SUM function for
the ‘ACTUAL’ column in Excel. First, I
need to check the current state of Excel
to locate the ‘ACTUAL’ column and
determine where to insert the sum formula.
Let me take a screenshot to see what's on
the screen: The ‘ACTUAL’ column has values that need
to be summed up, and there's an empty cell
in the 'Total budget' row.
Let me help insert the SUM function:
1. First, I'll click on the empty cell in
the 'Total budget' row under the
‘ACTUAL’ column:
I can see the Excel spreadsheet with a
wedding budget template. ‘left_click’: ‘[292, 692]’
3.
Step 3
2.
Now I'll type the SUM formula. The range
will be from C6 (first estimated value)
to C15 (last estimated value):
‘left_click_drag’: ‘[260, 312]’ to ‘[260,
645]’
Now I'll press Enter to complete the formula:
‘key’: ‘Enter’
Step 4
The SUM function has been successfully inserted and will calculate the total of all actual
expenses from rows 6 to 15. The total should match the existing $21,000.00 value that was
already in the template.
The formula =SUM(C6:C15) adds up all the values in the ACTUAL column, including:
• Reception ($5,300.00)
• Apparel ($5,000.00)
• Photography ($2,950.00)
• Decorations ($2,300.00)
• Transportation ($1,400.00)
• Gifts ($1,300.00)
• Music ($1,250.00)
• Flowers ($800.00)
• Stationery/Printing ($400.00)
• Other Expenses ($0.00)
Figure 17: Representative failure task in Office Productivity Software. We categorize
and highlight the response of the model into: Planning , Action , and Critic . The Error
here is as follows: While the model correctly identified in Step 2 that the “ACTUAL”
column needed to be summed, an error occurred in Step 3 where the model mistakenly
selected the range (C6 to C15) instead of the correct (D6 to D16). Also, the “Other
Expenses” item was erroneously excluded from the calculation. In Step 4, although
the model successfully extracted the details of each expense to verify the total, it failed
to detect the incorrect selection of the summed column to follow the user’s request.
Please check Section 4.4.9 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 17 illustrate the model’s planning phase for
inserting a SUM function in Excel. The model begins by identifying that it needs to
sum the values in the "ACTUAL" column and insert the result into the "Total budget"
row under this column. The planning specifies the steps to locate the empty cell in the
"Total budget" row, apply the SUM function to calculate the total of values from the
29
30. "ACTUAL" column, and ensure that this total reflects the correct aggregation of listed
expenses. This plan shows the model’s understanding of what range of cells is required
a summation operation within the task context.
Action. In Step 2, the model clicks on the empty cell in the "Total budget" row under the
"ACTUAL" column to begin entering the formula. In Step 3, it types the SUM formula,
selecting the range from cell C6 to C15 with a ’left_click_drag’ operation, and presses
"Enter" to complete the function.
Critic. After entering the formula, the model provides an explanation of the SUM
function used, describing that it sums all values in the range specified, from C6 to C15.
However, the model’s feedback assumes the calculation has been performed correctly
without verifying the accuracy of the selected range. This feedback indicates a lack of
more thorough final confirmation, particularly in ensuring that the selected cells align
with the context from user’s request.
Error. In Step 3, an error occurs when the model mistakenly selects the range from C6
to C15 instead of the correct range, D6 to D16, for the "ACTUAL" column. Besides, the
model also excludes the "Other Expenses" row from the summation. These two error
leads to an incomplete calculation. Although the model provides a detailed breakdown
of each item in the range as part of its verification in Step 4, it fails to detect the
discrepancy in the range selection, overlooking the correct column and missing cell to
sum the entire "ACTUAL" column.
This error case mainly showcases a limitation in the model’s range selection and mathe-
matical reasoning processes. Improved self-critic feedback and selection accuracy would
enhance the model’s ability to meet specific data processing requirements in Excel tasks
such as this case.
4.5
Case Study: Video Games
Video games present some of the most challenging tasks for GUI automation models
due to several factors. First, strong planning capabilities are required, as successful
gameplay involves developing strategies, managing resources, and reasoning through
exploration. Unlike standard software, exploration in games is often more complex
because important information or clues are not always immediately visible or easily
identifiable, requiring more advanced planning and adaptability. Second, video games
demand robust grounding abilities, as the visual style and interface elements differ
widely depending on the game’s theme or genre. Many in-game buttons and controls
are often represented by icons or symbols without text labels, requiring the model to
generalize its understanding across varying visual designs. In some cases, the model
must infer the function of a button or control based on context and reasoning. These
challenges make video games a uniquely demanding environment for automation models,
requiring a combination of dynamic planning and visual grounding.
In our case study, we select two popular video games: Hearthstone and Honkai: Star
Rail, to evaluate the model’s capabilities in handling complex gaming environments.
Hearthstone is a card game that emphasizes strategic deck building and tactical decision-
making during turn-based matches. This game tests the model’s ability to plan multi-step
actions, manage resources, and adapt strategies based on the evolving state of the game.
Honkai: Star Rail, although also turn-based, is more emphasizing on role-playing with
rich graphics and dynamic interfaces that require more diverse interaction. It challenges
the model’s visual grounding abilities, as the game features setting in a science-fantasy
universe, with intricate menus, icons without text labels, and rapidly changing anime
scenes. While the turn-based interactions in these games are simpler compared to real-
time action games, our focus is to illustrate the model’s versatility beyond traditional
software interfaces, showcasing its adaptability to more complex, visually richer gaming
environments.
30
31. 4.5.1 Hearthstone: Create and Rename a New Deck for Battle
" System Message: Hearthstone is a turn-based card game where players aim to reduce opponent’s health using decks of minion and
spell cards.
! User Query: Let‘s start with creating a new Mage deck to play with. Select the ’Core Mage‘ preset deck, and rename it to
’Sonnet‘s New Deck' after you create it.
Step 1
I see we're already in Hearthstone. I
notice there's a "New Deck" button in the
bottom right. Let me click on that to
start creating a new deck.
Step 2
Step 3
I'll select the Mage class since that's
what we want. The Mage portrait is in the
middle row on the left. I see there's a "Choose" button on the
right side of the screen. I'll click that
to proceed with creating the Mage deck.
‘left_click’: ‘[470, 443]’ ‘left_click’: ‘[1082, 633]’
‘left_click’: ‘[1107, 549]’
Step 4
Step 5
I see the "Core Mage" preset deck option,
which is what we want. I'll click on that. I'll click the "Choose" button to proceed
with creating the Core Mage deck.
‘left_click’: ‘[302, 445]’ ‘left_click’: ‘[813, 637]’
Now let me take another screenshot to see
if we need to confirm or name the deck.
Step 7
Now I‘ll type the new name “Sonnet’s New
Deck”. I'll press Enter to confirm the
name change.
‘type’: ‘Sonnet’s New Deck’, ‘key’:
‘Enter’
Step 6
I see the Core Mage deck title in the top
right. Let me try clicking on it to rename
it.
‘left_click’: ‘[1019, 113]’
Step 8
I‘ll click ‘OK‘ to confirm the name change.
‘left_click’: ‘[637, 590]’
We're done with the deck creation process.
Step 9
Let me take a screenshot to confirm we've
completed the deck creation process.
Great! We've successfully:
• Created a new Mage deck using the Core
Mage preset
• Renamed it to "Sonnet's New Deck"
The deck is now ready to use.
Figure 18: Representative task in Video Game: Hearthstone. We categorize and highlight
the response of the model into: Planning , Action , and Critic . Please check Section
4.5.1 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 18 detail the model’s planning process for
creating and renaming a new deck in Hearthstone. The model begins by identifying that
it needs to create a new deck using the Mage class and selecting the "Core Mage" preset
deck option. After creating the deck, the model plans to rename it to "Sonnet’s New
Deck" as user’s request. This ordered approach reflects the model’s in-context learning
of Hearthstone’s deck creation process, whereas is not expected to be previously learned
by model, involving navigating through class selection, choosing a preset, and confirming
deck options.
Action. The brown captions highlight the sequence of actions the model performs to
accomplish the deck creation and renaming. In Step 1, the model clicks on the "New
Deck" button to start the deck creation process. Moving to Step 2, it selects the Mage
31
32. class by clicking on the Mage portrait and then proceeds to click "Choose" in Step 3 to
confirm the selection. In Step 4, the model locates and selects the "Core Mage" preset
deck option, following up by clicking "Choose" again in Step 5 to proceed with creating
the deck. In Step 6, the model clicks on the deck’s title at the top of the screen, initiating
the renaming function. It then types the new name "Sonnet’s New Deck" in Step 7,
presses "Enter" to confirm, and completes the process by clicking "OK" in Step 8. The
model succeed in navigating through Hearthstone’s user interface accurately, executing
multiple steps to finalize both the deck creation and renaming.
Critic. The green captions reflect the model’s feedback and verification at each stage
of the process. After renaming the deck in Step 8, the model confirms that the name
change was successful, given visual feedback that the deck is now titled "Sonnet’s New
Deck." In the final step, the model verifies that it has completed all tasks by confirming
the deck’s availability for use.
This example showcases the model’s proficiency in handling a multi-step task within a
game interface. Although detailed user’s instruction of tasking steps is provided, the
model’s ability to interpret and interact with various visual elements in a gaming context
still strongly demonstrates its adaptability across complex interface designs.
4.5.2 Hearthstone: Hero Power
" System Message: Hearthstone is a turn-based card game where players aim to reduce opponent’s health using decks of minion and
spell cards.
! User Query: Now, it is your turn. You can attack with your minions or use the Mage’s hero power.
Step 1
I can see the game state:
1. You are playing as a Mage.
2. You have 3/3 mana available (blue
crystals on the right)
3. The opponent has two minions:
• A 1/4 minion (1 attack, 4 health)
• A 2/1 minion (2 attack, 1 health)
4. The Mage's Hero Power is available
(costs 2 mana, deals 1 damage)
Since you have 3 mana and the Hero Power
costs 2 mana, you can use it. First, I'll
click on the Hero Power:
Step 2 Step 3
The most efficient target would be the 2/1
minion since it will die to the 1 damage
from the Hero Power. Now, I'll target the
2/1 minion: The hero power successfully eliminated the
2/1 minion. Since I‘ve used 2 mana for the
Hero Power and have 1 mana remaining, and
I don’t see 1-mana cards Clockwork Gnomes
in your hand that would be useful to play,
I'll end the turn.
‘left_click’: ‘[642, 344]’
‘left_click’: ‘[1022, 383]’
I've completed the turn by:
1. Using the Mage's Hero Power (2 mana) to
kill the opponent's 2/1 minion
2. Ending the turn since we had no good
plays for the remaining 1 mana
The board is now in a better state with
only one enemy minion instead of two.
‘left_click’: ‘[688, 410]’
Figure 19: Representative task in Video Game: Hearthstone. We categorize and highlight
the response of the model into: Planning , Action , and Critic . Please check Section
4.5.2 for more discussions. Zoom in for the best view.
Planning. The blue captions in Figure 19 illustrate the model’s planning process,
where it demonstrates an awareness of the initial game state in Hearthstone. The
model accurately identifies that it is playing as a Mage with 3 available mana points.
Recognizing that the Mage’s Hero Power (which costs 2 mana and deals 1 damage) is
available, the model evaluates the opponent’s board to find the most efficient target for
this power. Notably, it selects the enemy minion with 1 health, as this is exactly enough
for the Hero Power to eliminate it. As confirmed by skilled players, this is one of the
optimal actions for this turn. This decision-making process showcases the model’s ability
to interpret both game-specific resources (like mana) and effective targeting strategies
based on current game conditions.
Action. Unlike standard software applications, which often have flat and straightforward
interface designs, Hearthstone’s interface is richly illustrated with a fantasy art style
32
33. like a chessboard, making icons and elements more challenging to distinguish. Despite
this complexity, the model successfully locates the Hero Power icon and identifies the
relevant minions on the opponent’s board. Additionally, the model demonstrates the
ability to interpret visual attributes, such as health points displayed as red numbers
on minions, to evaluate each target’s vulnerability. This capability enables the model
to interact effectively within the stylized gaming environment and make well-informed
moves.
Critic. Upon using the Hero Power to eliminate the 2/1 minion, the model verifies
the game state, observing that the board now has only one enemy minion remaining,
resulting in a more favorable situation. With no efficient way to use the remaining 1
mana, the model decides to end the turn. We suggest there may be more action here, but
as part of the game turn, the core action of this turn is to use Heroic Skills to eliminate
the 2/1 minion, and the model has successfully achieved this. This process reflects
the model’s ability to analyze the game state and make strategic decisions based on
available resources. The model also can generalize critic functions even in such a more
visually complex and stylized gaming context, as observing the final board state here.
4.5.3 Honkai: Star Rail : Warp Automation
Planning. The blue captions in Figure 20 illustrate the model’s planning for automating
a 10-warp pull sequence in Honkai: Star Rail. For this task, we provide detailed step-by-
step instructions for model to follow. The model starts by analyzing the necessary steps:
accessing the Warp menu, selecting the "Eyes of a Ninja" warp option for the 10-warp
pull, and initiating the warp sequence. Following the start, the model plans to skip the
warp animations using the skip arrow in the upper right corner if it appears, and finally,
to close the summary screen once the warp pull is complete.
Action. In Step 1, the model accesses the game menu by pressing "Escape" and then
navigates to the Warp icon in Step 2. Upon entering the Warp screen, the model locates
and selects the "Eyes of a Ninja" banner, choosing the 10-warp option in Step 3. Once
the warp sequence begins, the model repeatedly clicks the skip arrow in the animation
screen (as seen in Steps 4 through 7) to bypass the animations, expediting the process.
At the end of the warping sequence, the model proceeds to the summary screen in Step
8, where it clicks the "X" icon to exit and finish the pull.
Critic. During the warp animation, the model frequently checks for the skip arrow
in the upper right corner, confirming each click to skip the cutscene efficiently. On
the summary page, it verifies the pull results, noting the characters acquired and
ensuring the warp process is complete before clicking to exit. This consistent feedback
loop provides confidence that each phase was successfully executed, and the desired
outcomes—acquiring characters and closing the warp summary—were achieved as
intended.
This example fully demonstrate the model’s ability to accurately follow multi-step se-
quences in unfamiliar game environments, aided by sufficiently detailed user instructions.
This finding also emphasizes the importance of user instructions, while highlighting the
strong alignment of the model with them. The model demonstrates its ability to automate
and streamline the repetitive gaming task effectively to boost player’s experience as an
aiding tool.
4.5.4 Honkai: Star Rail : Daily Mission Clean up Automation
Planning. The blue captions in Figure 21 describe the model’s planning phase for
automating a daily mission in Honkai: Star Rail. The model starts by analyzing the
steps needed to accomplish the task: accessing the Interastral Guide, selecting the
specific mission (Calyx "Golden for EXP"), adjusting the number of attempts, starting
the challenge, activating the auto-battle mode, and finally exiting after the battle con-
cludes. Each step is carefully designed based on the game’s interface and the expected
33
34. 🤖 System
Prompt: You are currently playing Honkai: Star Rail, a turn-based role-playing game. Honkai: Star Rail follows the gameplay style
of classic Japanese role-playing games: players build up a lineup of characters and control a team of up to four in turn-based combat.
👤 User
Query: Your objective is to perform a 10-warp pull for me. Press escape on the keyboard to open the menu, then click warp, it should
open the warp page, and the first entry on the left side would be "eyes of a ninja", this would be the destination pool. then click on
"warp x10" to perform a 10-warp pull, then click at the blank space at the right-up corner to reveal the arrow at the right-up corner,
then click on the arrow to skip the animation. Always click on the arrow to continue skipping the animation if there is arrow at the
right-up corner. Only when all animations are skipped by clicking on the arrows, the pull summary page will appear and there would be a
cross there, click on the cross to finish the pull. Good luck!
Step 1
Step 2
I'll help you perform a 10-warp pull in
Star Rails. Let me execute these actions
step by step:
First, Let me press Escape to open the menu: I see the Warp icon in the menu. Let me
click on it:
‘keys’: ‘Escape’ Let me take another screenshot to see if
we're on the warp page:
Let me take another screenshot to see the
menu:
‘left_click’: ‘[3199, 1333]’,
‘left_click’: ‘[3365, 1966]’
Now let me take another screenshot to see
if we need to skip animations:
Step 5
Step 4
I see we're in the animation sequence. Let
me click in the upper right corner to
reveal the skip arrow: I see we're still in the animation
sequence. Let me click in the upper right
again to reveal and click the skip arrow:
‘left_click’: ‘[3654, 141]’ ‘left_click’: ‘[3711, 112]’
Now let me click on the skip arrow that
should have appeared: Let me take another screenshot to see if
we need to skip more animations:
Step 7
Great! Another 5-star character Rappa! I
see we're still in the animation sequence.
Let me click in the upper right again to
reveal and click the skip arrow:
Step 3
Great! We're on the warp page and I can
see the "Eyes of a Ninja" banner. Now let
me click on the "Warp ×10" button:
Step 6
Great! Another 5-star character! I see
we're still in the animation sequence. Let
me click in the upper right again to
reveal and click the skip arrow:
‘left_click’: ‘[3711, 112]’
Let me take another screenshot to see if
we need to skip more animations:
Step 8
Now I see we're at the summary page with
the cross (X) in the upper right corner.
Let me click it to finish the pull:
‘left_click’: ‘[3722, 129]’
Step 9
The 10-warp pull has been completed! Let
me know if you'd like to see what
characters you got or if you'd like to do
another pull!
‘left_click’: ‘[3711, 112]’
Let me take another screenshot to see if
we need to skip more animations:
Figure 20: Representative task in Video Game: Honkai: Star Rail. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.5.3 for more discussions. Zoom in for the best view.
interactions, showcasing the model’s understanding of the game’s task flow, and strictly
following the sequence required to automate this routine mission.
Action. In Step 1, the model presses the "Escape" key to open the game menu and
navigates to the Interastral Guide. In Step 3, it selects "Calyx Golden for EXP" and
then locates and clicks on the "Bud of Memories" entry to begin the mission setup. In
Step 5, the model increases the number of challenge attempts to 6 by clicking the "+"
button multiple times. It then initiates the challenge by clicking "Start Challenge" in
34
35. " System Message: You are currently playing Honkai: Star Rail, a turn-based role-playing game. Honkai: Star Rail follows the
gameplay style of classic Japanese role-playing games: players build up a lineup of characters and control a team of up to four
in turn-based combat.
! User Query: Your objective is to finish a daily game task for me. Press Escape on the keyboard to open the menu, then click
interastral guide, then click “calyx golden for exp” entry on the left side of the popped up game window, only then click on
the teleport button on the same line of the first entry named buds of MEMORIES (you need to carefully check the name), then
click “plus +” button 5 times to increase attempts to 6, then click challenge, then click start challenge. Then click the auto-
battle button at the rigthup corner - carefully count from the right to the left, it should be the second icon, it is near the
“pause” icon, it looks like a “infinite” symbol. Then click on exit when battle is completed.
Step 1
Step 2
I see we're already in Star Rail. I'll help
you complete this daily task. Let me
execute these actions step by step:
First, let's press Escape to open the menu: I can see the Interastral Guide icon in
the bottom right section. Let me click on
it:
‘keys’: ‘Escape’ Let me take another screenshot to locate
the "Calyx Golden for EXP" option:
Let me take a screenshot to locate the
Interastral Guide:
Step 4
I see the "Bud of Memories" entry. Let me
click its teleport button:
‘left_click’: ‘[3115, 816]’
Now let me take another screenshot to
locate the "+" button to increase
attempts:
‘left_click’: ‘[3458, 1266]’
‘left_click’: ‘[3444, 93]’
‘left_click’: ‘[942, 908]’
Now let me take another screenshot to find
the "Buds of MEMORIES" entry:
Step 5 Step 6
I see the “+” button. I‘ll click it 5
times to increase attempts to 6, and then
click the Challenge button: I see the Start Challenge button. Let me
click it:
‘left_click’: ‘[3694, 1800]’ * 5,
‘left_click’: ‘[3230, 1966]’
Let me take another screenshot to locate
the Start Challenge button:
Step 7
I see the auto-battle button in the top-
right corner. Let me click it:
Step 3
I see the Calyx (Golden) option. Let me
click on it:
‘left_click’: ‘[3345, 1969]’
Now let me take another screenshot to
locate the auto-battle button (infinite
symbol) near the pause icon:
Step 8
Now I'll wait for the battle to complete
and then click exit. Let me take another
screenshot after a moment to see if the
battle is done:
Step 9
The battle is still ongoing. Let me wait a
bit longer and take another screenshot:
The battle is still in progress (showing
Battle Start 3/6). Let me wait a bit
longer and take another screenshot:
The battle is complete. Now let me click
the Exit button:
‘left_click’: ‘[1428, 1893]’
Step 10
Step 11
Perfect! We‘re done with the daily task.
The battle is still in progress. Let me wait
a bit longer and take another screenshot:
Figure 21: Representative task in Video Game: Honkai: Star Rail. We categorize and
highlight the response of the model into: Planning , Action , and Critic . Please check
Section 4.5.4 for more discussions. Zoom in for the best view.
35
36. Step 6. Once the battle starts, the model activates auto-battle mode by clicking the icon
in Step 7. It waits for the battle to complete, periodically taking screenshots to check
the battle’s progress. Finally, after confirming that the mission is complete, the model
exits the challenge in Step 11.
Critic. After each action, the model confirms its current state, ensuring the tageting
steps executed as planned. For instance, after initiating the auto-battle in Step 7, it
monitors the battle’s progress through periodic checks, observing visual cues (like
"Battle Start" indicators) to determine the stage of completion. Following the completion
screen in Step 11, the model verifies that the task has concluded successfully, indicating
that the mission has been fully automated and completed. This feedback process
demonstrates the model’s capacity to adapt its actions based on in-game visual feedback,
confirming each step’s success in such a long trajectory task.
This example mainly showcases the model’s adeptness in navigating complex game
interface, handling a long trajectory task that require multi-step interactions. The model
successfully automates a daily mission routine, setting parameters for attempts, and
monitoring the battle’s progress, all while ensuring consistency with the user’s request.
This capability highlights the model’s great potential in aiding complicated or repetitive
gaming tasks that blend strategy, automation, and real-time evaluation.
5
5.1
Discussion
Error Categorization
We present some representative failure cases in the evaluation, as in Section 4.2.3, 4.4.4,
4.4.7, and 4.4.9. These cases highlight specific areas where the model actions did not
align with the user’s intended outcomes, revealing limitations in its task comprehension
and/or execution. Though the error that causes the failure of the task is versatile, we
propose to categorize the presenting error based on our evaluation aspects, explicitly
into three sources: Planning Error (PE), Action Error (AE) and Critic Error (CE).
These categories with examples may help in systematically identifying the natural cause
of each failure:
1. Planning Error: Planning errors occur when the model generates an incorrect
plan from task queries, often due to misinterpretation of the task instructions or
incorrect current computer state understanding. For example, Task - Fox Sports
Subscription , Figure 4.
2. Action Error: Action errors occur when the agent fails to perform the correct action,
given the plan itself is accurate. These errors often relate to the inability in interface
understanding, spatial recognition, or precise control within the GUI environment.
For example, Task - Insert a Sum Equation over Cells , Figure 17.
3. Critic Error: Critic errors occur when the agent incorrectly assesses its own actions
or computer state, leading to erroneous action completion feedback. For example,
Task - Update Name and Phone Number on Resume Template , Figure 12, and Task
- Insert Numbering Symbol , Figure 15.
5.2
Toward Future GUI Agents
Future benchmarking API-based Computer Use models. For future benchmarks,
there is a critical need for more dynamic and interactive environments that accurately
reflect real-world complexities, e.g., different versions of software as the providers
update. Moreover, we find that the screen resolution is vital for GUI agents, and this
diversity can be considered in the future. Current static datasets and limited interaction
paradigms restrict the assessment of an agent’s adaptability and capacity to respond to
real-world applications.
36
37. Critic error correction. Our evaluations reveal that the model frequently misjudges
task completion, particularly assuming a task has been completed. This tendency
highlights a shortfall in the model’s self-assessment mechanisms. Although some of
these problems can be addressed through prompting, a complete solution to this still
may require improvements to the GUI agent framework, such as an internalized strict
critic module.
Discrepancy with real human computer use. Current model still fail to fully repli-
cate the nuanced human computer use, for example, page scrolling and rigorous brows-
ing. An obvious drawback is that page scrolling based on ‘Page Up/Down‘ shortcuts
loses a huge portion of the coherence, resulting in fragmented or incomplete interface
information. These discrepancies are largely attributed to limitations in training data,
which may not fully capture the variability and context-specific adaptations seen in
human users.
6
Conclusion
In this study, we presented a preliminary case study of API-based GUI agent, Claude
3.5 Computer Use, focusing on its performance across diverse desktop environments,
including web navigation, workflow, productivity software, and video games. Our case
study highlights both the potential and limitations of the current model, particularly in
the aspects of planning, action execution, and critic feedback. By providing an out-of-
the-box framework, Computer Use Out-of-the-Box, we aim to bridge the accessibility
gap to seamlessly deploy and benchmark these models in real-world scenarios. We hope
our framework and evaluation approach will contribute to the foundation for further
advancements in GUI agent research, driving progress toward more sophisticated and
reliable automated computer use models.
References
[1] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction
tuning. In Neural Information Processing Systems, 2023.
[2] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4:
Enhancing vision-language understanding with advanced large language models.
arXiv preprint arXiv:2304.10592, 2023.
[3] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang
Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization em-
powers large language models with multimodality. arXiv preprint arXiv:2304.14178,
2023.
[4] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei
Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint
arXiv:2305.03726, 2023.
[5] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng,
Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also
an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175,
2023.
[6] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang
Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language
model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[7] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang,
Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho-
seiny. Minigpt-v2: large language model as a unified interface for vision-language
multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
37
38. [8] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao.
Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint
arXiv:2306.15195, 2023.
[9] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma,
and Furu Wei. Kosmos-2: Grounding multimodal large language models to the
world. arXiv preprint arXiv:2306.14824, 2023.
[10] Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, Jun 2023.
[11] Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths.
Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427,
2023.
[12] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang,
Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei,
and Ji-Rong Wen. A survey on large language model based autonomous agents.
http://arxiv.org/abs/2308.11432, 2023.
[13] Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng
Kong. Corex: Pushing the boundaries of complex reasoning through multi-model
collaboration. arXiv preprint arXiv:2310.00280, 2023.
[14] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin
Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou,
Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT:
Meta programming for a multi-agent collaborative framework. In The Twelfth
International Conference on Learning Representations, 2024.
[15] Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta
Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi
Ikeuchi, Hoi Vo, Li Fei-Fei, and Jianfeng Gao. Agent ai: Surveying the horizons of
multimodal interaction, 2024.
[16] Qiushi Sun, Zhirui Chen, Fangzhi Xu, Kanzhi Cheng, Chang Ma, Zhangyue Yin, Jian-
ing Wang, Chengcheng Han, Renyu Zhu, Shuai Yuan, et al. A survey of neural code
intelligence: Paradigms, advances and beyond. arXiv preprint arXiv:2403.14734,
2024.
[17] Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu,
Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer
agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
[18] Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma,
Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. Ufo: A ui-focused agent for
windows os interaction. arXiv preprint arXiv:2402.07939, 2024.
[19] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang,
and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui
agents. arXiv preprint arXiv:2401.10935, 2024.
[20] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan
Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language
model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 14281–14290, 2024.
[21] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is
a generalist web agent, if grounded. In Forty-first International Conference on
Machine Learning, 2024.
38
39. [22] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina
Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al.
Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint
arXiv:2112.09332, 2021.
[23] Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang,
Yejin Choi, and Bill Yuchen Lin. Agent lumos: Unified and modular training for
open-source language agents. In Proceedings of the 62nd Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pages
12380–12403, 2024.
[24] Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao
Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: Bootstrap
and reinforce a large language model-based web navigating agent. arXiv preprint
arXiv:2404.03648, 2024.
[25] Zhuosheng Zhang and Aston Zhang. You only look at screens: Multimodal chain-of-
action agents. arXiv preprint arXiv:2309.11436, 2023.
[26] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu,
and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint
arXiv:2312.13771, 2023.
[27] Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong,
Yi Chang, and Qi Wang. Screenagent: A vision language model-driven computer
control agent. arXiv preprint arXiv:2402.07945, 2024.
[28] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen
Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented
desktop graphical user interface automation. arXiv preprint arXiv:2312.13108,
2023.
[29] Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran
Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. xlam: A family of large
action models to empower ai agent systems. arXiv preprint arXiv:2409.03215,
2024.
[30] Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Juntao Tan,
Thai Hoang, Liangwei Yang, Yihao Feng, Zuxin Liu, et al. Agentohana: Design
unified data and training pipeline for effective agent learning. arXiv preprint
arXiv:2402.15506, 2024.
[31] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie
Tang. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint
arXiv:2310.12823, 2023.
[32] Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang,
Yejin Choi, and Bill Yuchen Lin. Lumos: Learning agents with unified data, modular
design, and open-source llms. arXiv preprint arXiv:2311.05657, 2023.
[33] Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya
Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control
agents. arXiv preprint arXiv:2406.03679, 2024.
[34] Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, and Zhiyong Wu. Inter-
active evolution: A neural-symbolic self-training framework for large language
models. arXiv preprint arXiv:2406.11736, 2024.
[35] Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu,
Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual
grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024.
39
40. [36] Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search
for language model agents. arXiv preprint arXiv:2407.01476, 2024.
[37] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
and Yuan Cao. React: synergizing reasoning and acting in language models (2022).
arXiv preprint arXiv:2210.03629, 2023.
[38] Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5
haiku, October 2024. Accessed: 2024-11-02.
40