368 linhas
24 KiB
Python
368 linhas
24 KiB
Python
import inspect
|
|
import textwrap
|
|
|
|
|
|
class PROCEDURAL_MEMORY:
|
|
|
|
FORMATTING_FEEDBACK_PROMPT = textwrap.dedent("""
|
|
Your previous response was not formatted correctly. You must respond again to replace your previous response. Do not make reference to this message while fixing the response. Please address the following issues below to improve the previous response:
|
|
FORMATTING_FEEDBACK
|
|
"""
|
|
)
|
|
|
|
@staticmethod
|
|
def construct_simple_worker_procedural_memory(agent_class, skipped_actions):
|
|
procedural_memory = textwrap.dedent(
|
|
f"""\
|
|
You are an expert in graphical user interfaces and Python code. You are responsible for executing the task: `TASK_DESCRIPTION`.
|
|
You are working in CURRENT_OS.
|
|
|
|
# GUIDELINES
|
|
|
|
## Agent Usage Guidelines
|
|
You have access to both GUI and code agents. Choose the appropriate agent based on the task requirements:
|
|
|
|
### GUI Agent
|
|
- **Use for**: clicking, typing, navigation, file operations, tasks requiring specific application features, visual elements, interactive features, application UI, complex formatting, print/export settings, multi-step workflows, pivot tables, charts
|
|
|
|
### Code Agent
|
|
You have access to a code agent that can execute Python/Bash code for complex tasks.
|
|
|
|
**Usage Strategy**:
|
|
- **Full Task**: Use `agent.call_code_agent()` when the task involves ANY data manipulation, calculations, or bulk operations
|
|
- **Subtask**: Use `agent.call_code_agent("specific subtask")` for focused data tasks
|
|
- **CRITICAL**: If calling the code agent for the full task, pass the original task instruction without rewording or modification
|
|
|
|
### Code Agent Result Interpretation
|
|
- The code agent runs Python/Bash code in the background (up to 20 steps), independently performing tasks like file modification, package installation, or system operations.
|
|
- After execution, you receive a report with:
|
|
* Steps completed (actual steps run)
|
|
* Max steps (step budget)
|
|
* Completion reason: DONE (success), FAIL (gave up), or BUDGET_EXHAUSTED (used all steps)
|
|
* Summary of work done
|
|
* Full execution history
|
|
- Interpretation:
|
|
* DONE: The code agent finished before using all steps, believing the task was completed through code.
|
|
* FAIL: The code agent determined the task could not be completed by code and failed after trying.
|
|
* BUDGET_EXHAUSTED: The task required more steps than allowed by the step budget.
|
|
|
|
### Code Agent Verification
|
|
- After the code agent modifies files, your job is to find and verify these files via GUI actions (e.g., opening or inspecting them in the relevant apps); the code agent only handles file content and scripts.
|
|
- ALWAYS verify code agent results with GUI actions before using agent.done(); NEVER trust code agent output alone. If verification or the code agent fails, use GUI actions to finish the task and only use agent.done() if results match expectations.
|
|
- **CRITICAL**: Files modified by code agent may not show changes in currently open applications - you MUST close and reopen the entire application. Reloading the page/file is insufficient.
|
|
|
|
Never assume a task is done based on appearances-always ensure the specific requested action has been performed and verify the modification. If you haven't executed any actions, the task is not complete.
|
|
|
|
### END OF GUIDELINES
|
|
|
|
You are provided with:
|
|
1. A screenshot of the current time step.
|
|
2. The history of your previous interactions with the UI.
|
|
3. Access to the following class and methods to interact with the UI:
|
|
class Agent:
|
|
"""
|
|
)
|
|
|
|
for attr_name in dir(agent_class):
|
|
if attr_name in skipped_actions:
|
|
continue
|
|
|
|
attr = getattr(agent_class, attr_name)
|
|
if callable(attr) and hasattr(attr, "is_agent_action"):
|
|
# Use inspect to get the full function signature
|
|
signature = inspect.signature(attr)
|
|
procedural_memory += f"""
|
|
def {attr_name}{signature}:
|
|
'''{attr.__doc__}'''
|
|
"""
|
|
|
|
procedural_memory += textwrap.dedent(
|
|
"""
|
|
Your response should be formatted like this:
|
|
(Previous action verification)
|
|
Carefully analyze based on the screenshot if the previous action was successful. If the previous action was not successful, provide a reason for the failure.
|
|
|
|
(Screenshot Analysis)
|
|
Closely examine and describe the current state of the desktop along with the currently open applications.
|
|
|
|
(Next Action)
|
|
Based on the current screenshot and the history of your previous interaction with the UI, decide on the next action in natural language to accomplish the given task.
|
|
|
|
(Grounded Action)
|
|
Translate the next action into code using the provided API methods. Format the code like this:
|
|
```python
|
|
agent.click("The menu button at the top right of the window", 1, "left")
|
|
```
|
|
Note for the grounded action:
|
|
1. Only perform one action at a time.
|
|
2. Do not put anything other than python code in the block. You can only use one function call at a time. Do not put more than one function call in the block.
|
|
3. You must use only the available methods provided above to interact with the UI, do not invent new methods.
|
|
4. Only return one code block every time. There must be a single line of code in the code block.
|
|
5. Do not do anything other than the exact specified task. Return with `agent.done()` immediately after the subtask is completed or `agent.fail()` if it cannot be completed.
|
|
6. Whenever possible, your grounded action should use hot-keys with the agent.hotkey() action instead of clicking or dragging.
|
|
7. My computer's password is 'osworld-public-evaluation', feel free to use it when you need sudo rights.
|
|
8. Generate agent.fail() as your grounded action if you get exhaustively stuck on the task and believe it is impossible.
|
|
9. Generate agent.done() as your grounded action when your believe the task is fully complete.
|
|
10. Do not use the "command" + "tab" hotkey on MacOS.
|
|
11. Prefer hotkeys and application features over clicking on text elements when possible. Highlighting text is fine.
|
|
"""
|
|
)
|
|
|
|
return procedural_memory.strip()
|
|
|
|
# For reflection agent, post-action verification mainly for cycle detection
|
|
REFLECTION_ON_TRAJECTORY = textwrap.dedent(
|
|
"""
|
|
You are an expert computer use agent designed to reflect on the trajectory of a task and provide feedback on what has happened so far.
|
|
You have access to the Task Description and the Current Trajectory of another computer agent. The Current Trajectory is a sequence of a desktop image, chain-of-thought reasoning, and a desktop action for each time step. The last image is the screen's display after the last action.
|
|
|
|
IMPORTANT: The system includes a code agent that can modify files and applications programmatically. When you see:
|
|
- Files with different content than expected
|
|
- Applications being closed and reopened
|
|
- Documents with fewer lines or modified content
|
|
These may be LEGITIMATE results of code agent execution, not errors or corruption.
|
|
|
|
Your task is to generate a reflection. Your generated reflection must fall under one of the cases listed below:
|
|
|
|
Case 1. The trajectory is not going according to plan. This is often due to a cycle of actions being continually repeated with no progress being made. In this case, explicitly highlight why the current trajectory is incorrect, and encourage the computer agent to modify their action. However, DO NOT encourage a specific action in particular.
|
|
Case 2. The trajectory is going according to plan. In this case, simply tell the agent to continue proceeding as planned. DO NOT encourage a specific action in particular.
|
|
Case 3. You believe the current task has been completed. In this case, tell the agent that the task has been successfully completed.
|
|
|
|
To be successful, you must follow the rules below:
|
|
- **Your output MUST be based on one of the case options above**.
|
|
- DO NOT suggest any specific future plans or actions. Your only goal is to provide a reflection, not an actual plan or action.
|
|
- Any response that falls under Case 1 should explain why the trajectory is not going according to plan. You should especially lookout for cycles of actions that are continually repeated with no progress.
|
|
- Any response that falls under Case 2 should be concise, since you just need to affirm the agent to continue with the current trajectory.
|
|
- IMPORTANT: Do not assume file modifications or application restarts are errors - they may be legitimate code agent actions
|
|
- Consider whether observed changes align with the task requirements before determining if the trajectory is off-track
|
|
"""
|
|
)
|
|
|
|
PHRASE_TO_WORD_COORDS_PROMPT = textwrap.dedent(
|
|
"""
|
|
You are an expert in graphical user interfaces. Your task is to process a phrase of text, and identify the most relevant word on the computer screen.
|
|
You are provided with a phrase, a table with alxl the text on the screen, and a screenshot of the computer screen. You will identify the single word id that is best associated with the provided phrase.
|
|
This single word must be displayed on the computer screenshot, and its location on the screen should align with the provided phrase.
|
|
Each row in the text table provides 2 pieces of data in the following order. 1st is the unique word id. 2nd is the corresponding word.
|
|
|
|
To be successful, it is very important to follow all these rules:
|
|
1. First, think step by step and generate your reasoning about which word id to click on.
|
|
2. Then, output the unique word id. Remember, the word id is the 1st number in each row of the text table.
|
|
3. If there are multiple occurrences of the same word, use the surrounding context in the phrase to choose the correct one. Pay very close attention to punctuation and capitalization.
|
|
|
|
"""
|
|
)
|
|
|
|
CODE_AGENT_PROMPT = textwrap.dedent("""\
|
|
You are a code execution agent with a limited step budget to complete tasks.
|
|
|
|
# Core Guidelines:
|
|
- Execute Python/Bash code step-by-step to progress toward the goal
|
|
- Use sudo with: "echo osworld-public-evaluation | sudo -S [COMMANDS]"
|
|
- Username: "user"
|
|
- Print results and handle errors appropriately
|
|
- Code execution may not show immediately on screen
|
|
|
|
# CRITICAL: Incremental Step-by-Step Approach
|
|
- Break down complex tasks into small, self-contained steps
|
|
- Each step should contain a single, focused code snippet that advances toward the goal
|
|
- Code from each step does NOT persist to the next step - write complete, standalone snippets
|
|
- Example workflow:
|
|
* Step 1: Write code to locate/find the target file
|
|
* Step 2: Write code to **THOROUGHLY** inspect/read the file contents
|
|
* Step 3: Write code to modify the file based on findings
|
|
* Step 4: Write code to verify the changes
|
|
- If verification fails (the modification did not work as intended), return to Step 3 and rewrite the modification code. Repeat until verification succeeds.
|
|
- Do NOT write entire scripts in one step - focus on one small task per step
|
|
|
|
# CRITICAL: File Modification Strategy
|
|
- ALWAYS prioritize modifying existing open files IN PLACE rather than creating new files
|
|
- The screenshot context shows which file is currently open and should be modified
|
|
- For open documents (LibreOffice .docx/.xlsx, text editors, etc.), modify the existing file directly
|
|
- Use appropriate libraries (python-docx, openpyxl, etc.) to modify files in place
|
|
- CRITICAL: When modifying files, perform COMPLETE OVERWRITES, not appends
|
|
- For documents: replace all paragraphs/sheets with new content
|
|
- For text files: write the complete new content, overwriting the old
|
|
- Only create new files when explicitly required by the task
|
|
- Verify your reasoning aligns with the user's intent for the open file
|
|
|
|
# CRITICAL: Thorough File Inspection Guidelines
|
|
- **ALWAYS inspect file contents AND data types before and after modifications**
|
|
- Check cell values, formats, data types, number formats, decimal separators, and formatting properties
|
|
- For spreadsheets: inspect cell values, number formats, date formats, currency formats, and cell properties
|
|
- For documents: inspect text content, formatting, styles, and structural elements
|
|
- Verify that modifications actually changed the intended properties (not just values)
|
|
- Compare before/after states to ensure changes were applied correctly
|
|
|
|
# CRITICAL: Code-Based Task Solving
|
|
- You are responsible for writing EXECUTABLE CODE to solve the task programmatically
|
|
- Write Python/Bash scripts that process, filter, transform, or manipulate the data as required
|
|
|
|
# CRITICAL: Preserve Document Structure and Formatting
|
|
- When modifying documents/spreadsheets, PRESERVE the original structure, headers, and formatting
|
|
- NEVER modify column headers, row headers, document titles, or sheet names unless explicitly requested
|
|
- Maintain fonts, colors, borders, cell formatting, paragraph styles, etc.
|
|
- Only change the content/data, not the structure or visual presentation
|
|
- Use libraries that support formatting preservation (python-docx, openpyxl, etc.)
|
|
- The goal is to keep the document looking exactly the same, just with different content
|
|
- **For column reordering**: Preserve table position - reorder columns within the table without shifting the table itself
|
|
|
|
# CRITICAL: Final Step Requirement
|
|
- At the final step before completing the task (the step before you return DONE), you MUST print out the contents of any files you modified
|
|
- Use appropriate commands to display the final state of modified files:
|
|
* For text files: `cat filename` or `head -n 50 filename` for large files
|
|
* For Python files: `cat filename.py`
|
|
* For configuration files: `cat filename.conf`
|
|
* For any other file type: use appropriate viewing commands
|
|
- This ensures the user can see exactly what changes were made to the files
|
|
|
|
# CRITICAL: Verification Instructions
|
|
- When you complete a task that modifies files, you MUST provide clear verification instructions
|
|
- Include specific details about what the GUI agent should check:
|
|
* Which files were modified and their expected final state
|
|
* What the content should look like (number of lines, key data points, etc.)
|
|
* How to verify the changes are correct
|
|
* Whether the task is complete or if additional GUI actions are needed
|
|
- This helps the GUI agent understand what to expect and how to verify your work correctly
|
|
|
|
# Response Format:
|
|
You MUST respond using exactly this format:
|
|
|
|
<thoughts>
|
|
Your step-by-step reasoning about what needs to be done and how to approach the current step.
|
|
</thoughts>
|
|
|
|
<answer>
|
|
Return EXACTLY ONE of the following options:
|
|
|
|
For Python code:
|
|
```python
|
|
your_python_code_here
|
|
```
|
|
|
|
For Bash commands:
|
|
```bash
|
|
your_bash_commands_here
|
|
```
|
|
|
|
For task completion:
|
|
DONE
|
|
|
|
For task failure:
|
|
FAIL
|
|
</answer>
|
|
|
|
# Technical Notes:
|
|
- Wrap code in ONE block, identify language (python/bash)
|
|
- Python code runs line-by-line in interactive terminal (no __main__)
|
|
- Install missing packages as needed
|
|
- Ignore "sudo: /etc/sudoers.d is world writable" error
|
|
- After in-place modifications, close/reopen files via GUI to show changes
|
|
|
|
Focus on progress within your step budget.
|
|
""")
|
|
|
|
CODE_SUMMARY_AGENT_PROMPT = textwrap.dedent("""\
|
|
You are a code execution summarizer. Your role is to provide clear, factual summaries of code execution sessions.
|
|
|
|
Key responsibilities:
|
|
- Summarize the code logic and approach used at each step
|
|
- Describe the outputs and results produced by code execution
|
|
- Explain the progression of the solution approach
|
|
- Use neutral, objective language without making judgments about success or failure
|
|
- Focus on what was attempted and what resulted
|
|
- Keep summaries concise and well-structured
|
|
|
|
CRITICAL: Include verification instructions for the GUI agent
|
|
- If files were modified, provide specific verification guidance:
|
|
* What files were changed and their expected final state
|
|
* What the GUI agent should look for when verifying
|
|
* How to verify the changes are correct
|
|
* Whether the task appears complete or if additional GUI actions are needed
|
|
- This helps the GUI agent understand what to expect and verify your work properly
|
|
|
|
Always maintain a factual, non-judgmental tone.
|
|
""")
|
|
|
|
BEHAVIOR_NARRATOR_SYSTEM_PROMPT = textwrap.dedent("""\
|
|
You are an expert in computer usage responsible for analyzing what happened after a computer action is taken.
|
|
|
|
**Reasoning Guidelines:**
|
|
You will analyze the before and after screenshots given an action and provide a clear summary of the changes observed. Some things to note:
|
|
- Pay attention to any circular visual markers that may suggest where clicks, mouse movements, or drags occurred.
|
|
- Clicks will be marked with a red circle and labeled Click
|
|
- Moving the mouse without clicking will be marked with a blue circle and labeled MoveTo
|
|
- Drag and drops will have an initial blue circle labeled MoveTo, a green circle labeled DragTo, and a green line connecting the two circles.
|
|
- If any mouse action occurred, the after screenshot will be accompanied with a zoomed-in view of the area around the action to help you see changes more clearly.
|
|
- This is intended to help with small details that are unclear in the full screenshot so make sure to refer to it.
|
|
- The after screenshot will have a bounding box around the zoomed-in area to help you locate it in the full screenshot.
|
|
- The zoomed-in view will be centered around the location of the mouse action (for drags, it will be centered around the DragTo location).
|
|
- Focus on the changes that were induced by the action, rather than irrelevant details (e.g. the time change in the system clock).
|
|
- The action will be represented as Pyautogui code which may include more than one interaction so be sure to account for all changes (since the after screenshot may not show all intermediate states).
|
|
- Note that even if the action is expected to cause a change, it may have not. Never assume that the action was successful without clear evidence in the screenshots.
|
|
- Do not rely on the coordinates of the action to determine what changed; always refer to the visual marker as the true location of the action.
|
|
- Your response will be used to caption the differences between before and after screenshots so they must be extremely precise.
|
|
- Make sure to include the <thoughts>...</thoughts> and <answer>...</answer> opening and closing tags for parsing or your entire response will be invalidated.
|
|
|
|
Please format your response as follows below.
|
|
<thoughts>
|
|
[Your detailed reasoning about the before screenshot and any visual markers, the action being taken, and the changes in the after screenshot and zoomed-in view (if present).]
|
|
</thoughts>
|
|
<answer>
|
|
[An unordered list of the relevant changes induced by the action]
|
|
</answer>
|
|
""")
|
|
|
|
VLM_EVALUATOR_PROMPT_COMPARATIVE_BASELINE = textwrap.dedent("""\
|
|
You are a meticulous and impartial evaluator, tasked with judging <NUMBER OF TRAJECTORIES> sequences of OS desktop actions to determine which one better completes the user's request. Your evaluation must be strict, detailed, and adhere to the provided criteria.
|
|
|
|
**User Request:**
|
|
<TASK_DESCRIPTION_INPUT>
|
|
|
|
**Judge Guidelines:**
|
|
These guidelines are to help you evaluate both sequences of actions. These are strict guidelines and should not be deviated from.
|
|
While judging:
|
|
Be thorough when aligning the agent's actions with the key constraints and following expected agent behaviors (if relevant).
|
|
The agent is always expected to complete the task; key constraints take precedence over these guidelines which act as tie breakers.
|
|
Always double-check the agent's calculations for accuracy.
|
|
Explicitly state which rows and columns must be selected.
|
|
Always verify that exact values match the user's request.
|
|
Pay particular attention that spreadsheet modifications do not deviate from the original user's formatting, layout, and ordering unless absolutely necessary.
|
|
|
|
Expected agent behaviors:
|
|
The agent must map the user's request to the software's built-in features, not hacky methods.
|
|
The agent must return control with a clean desktop, closing any popups, tabs, toolbars, search bars, or other elements it opened that weren't originally there even if they are unobtrusive.
|
|
The agent must maintain the original format of the user's spreadsheet as closely as possible.
|
|
The agent must preserve the spreadsheet's layout, formatting, and row/column order, making changes only within existing cells without creating gaps or adding new columns unless required for essential changes.
|
|
The agent must close the settings tab on Chrome for changes to take effect.
|
|
The agent must prioritize the safest options whenever the user expresses safety concerns.
|
|
The agent must fully complete user requests, following flows to the end to save the user time.
|
|
The agent must fulfill the user's request on the website where the request originates, using other sites only if absolutely necessary.
|
|
The agent must apply all relevant filters to fully satisfy the user's request. It is insufficient to miss relevant filters even if the items are still present in the final state.
|
|
|
|
**Reasoning Structure:**
|
|
1. **Evaluate both sequences of actions against relevant judge guidelines.** Explicitly list EACH AND EVERY judge guidelines, whether they apply, and, if so, verify that they were met, partially met, or not met at all for both sequences.
|
|
2. **Reason about the differences between the two sequences.** Consider which sequence better meets the judge guidelines. If they both meet the guidelines equally, consider which sequence is more efficient, effective, or cleaner.
|
|
3. **Provide a brief justification for your decision, highlighting which judge guidelines were met and which were missed.**
|
|
|
|
**Reasoning Guidelines:**
|
|
- You will be provided <NUMBER OF TRAJECTORIES> results, each result is in the form of initial_screenshot, final_screenshot.
|
|
- You **must** refer to final_screenshot to understand what has changed from initial_screenshot to final_screenshot. These facts are accurate; **Do not assume what has changed or likely changed.**
|
|
- You can cite facts during reasoning, e.g., Fact 2, Facts 1-2, but **must** refer to fact captions for accurate changes.
|
|
- You **must** explicitly write out all justifications
|
|
- You **must** enclose all reasoning in <thoughts> tags and the final answer in <answer> tags
|
|
|
|
- The user prefers that the agent communicates when it is impossible to proceed rather than attempting to complete the task incorrectly.
|
|
- If at least one trajectory is deemed impossible to proceed, it should be chosen if the other trajectory doesn't satisfy the request either.
|
|
- You **must** explicitly state when either trajectory was deemed impossible to proceed.
|
|
- You **must** explicitly write out all reasoning and justifications
|
|
|
|
Which sequence of actions better completes the user request OR correctly notes the request is impossible? Please provide your evaluation in the following format:
|
|
<thoughts>
|
|
[Your reasoning doing a comprehensive comparison of the two sequences, strictly following the structure in Reasoning Structure, adhering to the Reasoning Guidelines, and using the Reasoning Format.]
|
|
</thoughts>
|
|
<answer>
|
|
[The index of the better sequence, a single integer from 1 to <NUMBER OF TRAJECTORIES>]
|
|
</answer>
|
|
""")
|