In the documentation below, you might see function calls prefixed with
agent.
. If you utilize destructuring in Playwright (e.g.,async ({ ai, aiQuery }) => { /* ... */ }
), you can call these functions without theagent.
prefix. This is merely a syntactical difference.
Each Agent in Midscene has its own constructor.
These Agents share some common constructor parameters:
generateReport: boolean
: If true, a report file will be generated. (Default: true)reportFileName: string
: The name of the report file. (Default: generated by midscene)autoPrintReportMsg: boolean
: If true, report messages will be printed. (Default: true)cacheId: string | undefined
: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled)actionContext: string
: Some background knowledge that should be sent to the AI model when calling agent.aiAction()
, like 'close the cookie consent dialog first if it exists' (Default: undefined)In Playwright and Puppeteer, there are some common parameters:
forceSameTabNavigation: boolean
: If true, page navigation is restricted to the current tab. (Default: true)waitForNetworkIdleTimeout: number
: The timeout for waiting for network idle between each action. (Default: 2000ms, set to 0 means not waiting for network idle)waitForNavigationTimeout: number
: The timeout for waiting for navigation finished. (Default: 5000ms, set to 0 means not waiting for navigation finished)Below are the main APIs available for the various Agents in Midscene.
In Midscene, you can choose to use either auto planning or instant action.
agent.ai()
is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model.agent.aiTap()
, agent.aiHover()
, agent.aiInput()
, agent.aiKeyboardPress()
, agent.aiScroll()
, agent.aiRightClick()
are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform.agent.aiAction()
or .ai()
This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them.
Parameters:
prompt: string
- A natural language description of the UI steps.options?: Object
- Optional, a configuration object containing:
cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Examples:
Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown.
For optimal results, please provide clear and detailed instructions for agent.aiAction()
. For guides about writing prompts, you may read this doc: Tips for Writing Prompts.
Related Documentation:
agent.aiTap()
Tap something.
Parameters:
locate: string | Object
- A natural language description of the element to tap, or prompting with images.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Promise<void>
Examples:
agent.aiHover()
Only available in web pages, not available in Android.
Move mouse over something.
Parameters:
locate: string | Object
- A natural language description of the element to hover over, or prompting with images.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Promise<void>
Examples:
agent.aiInput()
Input text into something.
Parameters:
text: string
- The final text content that should be placed in the input element. Use blank string to clear the input.locate: string | Object
- A natural language description of the element to input text into, or prompting with images.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.autoDismissKeyboard?: boolean
- If true, the keyboard will be dismissed after input text, only available in Android. (Default: true)Return Value:
Promise<void>
Examples:
agent.aiKeyboardPress()
Press a keyboard key.
Parameters:
key: string
- The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.locate?: string | Object
- Optional, a natural language description of the element to press the key on, or prompting with images.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Promise<void>
Examples:
agent.aiScroll()
Scroll a page or an element.
Parameters:
scrollParam: PlanningActionParamScroll
- The scroll parameter
direction: 'up' | 'down' | 'left' | 'right'
- The direction to scroll. Whether it is Android or Web, the scrolling direction here all refers to which direction of the page's content will appear on the screen. For example, when the scrolling direction is down
, the hidden content at the bottom of the page will gradually reveal itself from the bottom of the screen upwards.scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'
- Optional, the type of scroll to perform.distance: number
- Optional, the distance to scroll in px.locate?: string | Object
- Optional, a natural language description of the element to scroll on, or prompting with images. If not provided, Midscene will perform scroll on the current mouse position.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Promise<void>
Examples:
agent.aiRightClick()
Only available in web pages, not available in Android.
Right-click on an element. Please note that Midscene cannot interact with the native context menu in browser after right-clicking. This interface is usually used for the element that listens to the right-click event by itself.
Parameters:
locate: string | Object
- A natural language description of the element to right-click on, or prompting with images.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Promise<void>
Examples:
deepThink
feature
The deepThink
feature is a powerful feature that allows Midscene to call AI model twice to precisely locate the element. False by default. It is useful when the AI model find it hard to distinguish the element from its surroundings.
agent.aiAsk()
Ask the AI model any question about the current page. It returns the answer in string from the AI model.
Parameters:
prompt: string | Object
- A natural language description of the question, or prompting with images.options?: Object
- Optional, a configuration object containing:
domIncluded?: boolean | 'visible-only'
- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to 'visible-only'
, only the visible elements will be sent. Default: False.screenshotIncluded?: boolean
- Whether to send screenshot to the model. Default: True.Return Value:
Examples:
Besides aiAsk
, you can also use aiQuery
to extract structured data from the UI.
agent.aiQuery()
This method allows you to extract structured data from current page. Simply define the expected format (e.g., string, number, JSON, or an array) in the dataDemand
, and Midscene will return a result that matches the format.
Parameters:
dataDemand: T
: A description of the expected data and its return format.options?: Object
- Optional, a configuration object containing:
domIncluded?: boolean | 'visible-only'
- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to 'visible-only'
, only the visible elements will be sent. Default: False.screenshotIncluded?: boolean
- Whether to send screenshot to the model. Default: True.Return Value:
dataDemand
, and Midscene will return a matching result.Examples:
agent.aiBoolean()
Extract a boolean value from the UI.
Parameters:
prompt: string | Object
- A natural language description of the expected value, or prompting with images.options?: Object
- Optional, a configuration object containing:
domIncluded?: boolean | 'visible-only'
- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to 'visible-only'
, only the visible elements will be sent. Default: False.screenshotIncluded?: boolean
- Whether to send screenshot to the model. Default: True.Return Value:
Promise<boolean>
when AI returns a boolean value.Examples:
agent.aiNumber()
Extract a number value from the UI.
Parameters:
prompt: string | Object
- A natural language description of the expected value, or prompting with images.options?: Object
- Optional, a configuration object containing:
domIncluded?: boolean | 'visible-only'
- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to 'visible-only'
, only the visible elements will be sent. Default: False.screenshotIncluded?: boolean
- Whether to send screenshot to the model. Default: True.Return Value:
Promise<number>
when AI returns a number value.Examples:
agent.aiString()
Extract a string value from the UI.
Parameters:
prompt: string | Object
- A natural language description of the expected value, or prompting with images.options?: Object
- Optional, a configuration object containing:
domIncluded?: boolean | 'visible-only'
- Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to 'visible-only'
, only the visible elements will be sent. Default: False.screenshotIncluded?: boolean
- Whether to send screenshot to the model. Default: True.Return Value:
Promise<string>
when AI returns a string value.Examples:
agent.aiAssert()
Specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional errorMsg
and a detailed reason generated by the AI.
Parameters:
assertion: string | Object
- The assertion described in natural language, or prompting with images.errorMsg?: string
- An optional error message to append if the assertion fails.Return Value:
errorMsg
and additional AI-provided information.Example:
Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine .aiQuery
with standard JavaScript assertions instead of using .aiAssert
.
For example, you might replace the above code with:
agent.aiLocate()
Locate an element using natural language.
Parameters:
locate: string | Object
- A natural language description of the element to locate, or prompting with images.options?: Object
- Optional, a configuration object containing:
deepThink?: boolean
- If true, Midscene will call AI model twice to precisely locate the element. False by default.xpath?: string
- The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.cacheable?: boolean
- Whether cacheable when enabling caching feature. True by default.Return Value:
Promise
when the element is located parsed as an locate info object.Examples:
agent.aiWaitFor()
Wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified checkIntervalMs
.
Parameters:
assertion: string
- The condition described in natural language.options?: object
- An optional configuration object containing:
timeoutMs?: number
- Timeout in milliseconds (default: 15000).checkIntervalMs?: number
- Interval for checking in milliseconds (default: 3000).Return Value:
Examples:
Given the time consumption of AI services, .aiWaitFor
might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative.
agent.runYaml()
Execute an automation script written in YAML. Only the tasks
part of the script is executed, and it returns the results of all .aiQuery
calls within the script.
Parameters:
yamlScriptContent: string
- The YAML-formatted script content.Return Value:
result
property that includes the results of all .aiQuery
calls.Example:
For more information about YAML scripts, please refer to Automate with Scripts in YAML.
agent.setAIActionContext()
Set the background knowledge that should be sent to the AI model when calling agent.aiAction()
.
Parameters:
actionContext: string
- The background knowledge that should be sent to the AI model.Example:
agent.evaluateJavaScript()
Only available in web pages, not available in Android.
Evaluate a JavaScript expression in the web page context.
Parameters:
script: string
- The JavaScript expression to evaluate.Return Value:
Example:
agent.logScreenshot()
Log the current screenshot with a description in the report file.
Parameters:
title?: string
- Optional, the title of the screenshot, if not provided, the title will be 'untitled'.options?: Object
- Optional, a configuration object containing:
content?: string
- The description of the screenshot.Return Value:
Promise<void>
Examples:
agent.freezePageContext()
Freeze the current page context, allowing all subsequent operations to reuse the same page snapshot without retrieving the page state repeatedly. This significantly improves performance when executing a large number of concurrent operations.
Some notes:
agent.unfreezePageContext()
in time to restore the real-time page state.Return Value:
Promise<void>
Examples:
In the report, operations using frozen context will display a 🧊 icon in the Insight tab.
agent.unfreezePageContext()
Unfreezes the page context, restoring the use of real-time page state.
Return Value:
Promise<void>
.reportFile
The path to the report file.
You can override environment variables at runtime by calling the overrideAIConfig
method.
Set the DEBUG=midscene:ai:profile:stats
to view the execution time and usage for each AI call.
Set the MIDSCENE_RUN_DIR
variable to customize the run artifact directory.
Set the MIDSCENE_REPLANNING_CYCLE_LIMIT
variable to customize the maximum number of replanning cycles allowed during action execution (aiAction
).
LangSmith is a platform for debugging large language models. To integrate LangSmith, follow these steps:
After starting Midscene, you should see logs similar to:
You can use images as supplements in the prompt to describe things that cannot be expressed in natural language.
When prompting with images, the format of the prompt parameters is as follows:
Notes on Image Size
When prompting with images, it is necessary to pay attention to the requirements of the AI model provider regarding the size and dimensions of the images. Images that are too large (such as exceeding 10M) or too small (such as being less than 10 pixels) may cause errors when the model is invoked. The specific restrictions should be based on the documentation provided by the AI model provider you are using.