Screen-Capture Programming: What You See is What You Script
by George Lawton
Researchers at the University of Maryland and Massachusetts Institute of Technology have developed a screen-capture–based scripting environment that could signal a new programming paradigm that leverages the graphical interface as a sort of API. The Sikuli system lets users with minimal programming experience use GUI screen shots to create scripts that interact with applications. Ultimately, it will open opportunities to develop scripts that touch multiple applications without requiring any understanding of the underlying programs APIs.
Tom Yeh is a post-doctoral researcher at the University of Maryland and one of Sikuli's developers, along with MIT graduate student Tsung-Hsiang Chang and associate professor Robert C. Miller. Yeh compares Sikuli with the "what you see is what you get" GUI metaphor. With Sikuli, "what you see is what you script." Users with a basic understanding of the Python scripting language will be able to write programs via screen shots rather than lines of code.
"Since the release of Sikuli," Yeh said, "we've received many emails expressing gratitude of how Sikuli helped deal with certain tedious tasks that had been done by hand."
Sikuli: Seeing Pixels
Many other tools, such as AutoHotkey, help automate routine scripting tasks. But these tools are designed to run on only a single platform, and they don't support graphical interaction, as Sikuli does. But Miller says Sikuli's greatest value is its generality: "If it has pixels that Sikuli can see, then it's open to automation," he said. (Sikuli means "God’s Eye" in the language of native Mexicans.)
The technique is open to any application with a GUI that can display on a Windows, Mac, or Linux desktop. "We've already seen users apply it to not just desktop applications," Miller said, "but also Web pages, video games, mobile phone apps (running in a simulator or using a remote connection between the desktop and the phone), and applications from other platforms running in a virtual machine."
Miller noted another benefit: programmers can use any GUI they're familiar with. "That significantly reduces the cognitive gap between what they want to do and what they can do in Sikuli," he said.
The core Sikuli Script technology supports programming through a combination of machine vision, optical character recognition, and automation technologies. It lets users interact with graphical screen elements, such as dialogue boxes, specific text strings, and icons. The IDE is built on top of Jython, a Python implementation that runs on top Java.
Once the Sikuli IDE is installed, a user hits a keyboard command to activate a special box for highlighting screen elements. The user can then specify commands that use these elements to find icons, insert a cursor in a particular dialog box, or execute if-then statements.
Applications
A Sikuli-based application can interact with many types of screen elements. A simple script might automate tasks such as setting a computer's IP address by clicking on multiple icons and dialogue boxes in the right order and then typing in the appropriate strings.
More complex scripts can respond to screen events. For example, an airline tracking application might respond to a colored dot moving across a map and send alerts to interested parties when the airplane comes into proximity of the destination airport. Another application might notify a user when a particular event occurs on a Webcam being monitored in a desktop window.
The Sikuli team has been working on Sikuli Test, an application to help automate the monitoring of the GUI's response to overall system developments and changes. A test engineer can create a script to look for particular changes to screen elements in response to user input. For example, in music player applications, an icon often toggles between play and pause after it's pressed. Sikuli Test could monitor such cases, noting any instances in which the expected behavior didn't occur and further evaluation is required.
Another application, Sikuli Search, searches a database on the basis of what the screen shows. The Sikuli team found that users could retrieve help information for GUI elements faster by using Sikuli Search to select the elements on screen than by using traditional text-based searches. Users simply indicate the onscreen dialogue box for which they want more information, and Sikuli Search retrieves appropriate screen shots from online tutorials, official documentation, and computer books.
History
Miller said that researchers have been exploring the idea of software agents that look at screen pixels and understand what's going on for many years. In the mid-1990s, Richard Potter at the University of Maryland created a system called Triggers, which used low-level pixel patterns (such as a corner or an edge) to direct an automatic macro.
In the late 90s, Robert St. Amant, associate professor at North Carolina State University, and his students created VisMap, an intelligent system that operated a user interface using machine-vision techniques to respond to basic patterns on the screen according to a set of templates stored in a database. VisMap could play Windows Solitaire, but performance limitations of the existing hardware limited its further development.
St. Amant said he became interested in systems that could interpret visual representations because of the richness of visual interfaces and the progress in interface agents. "Most such agents worked behind the scenes," he said, citing Web recommender systems as an example. "I was more interested in ways an agent might help users work through problems they might have in the interface to applications." This meant that an agent would have to know what the user was looking at.
St. Amant realized that a visual scripting language would bridge the gap between the way users see interfaces and the way conventional APIs and command languages handle lower-level behaviors. "The promise is a path toward end-user programming that doesn't require users to learn a huge amount for each new application," he said. "A visual scripting language would be standardized, at some level, in the same way that there are conventions for visual presentation."
St. Amant said that it took several seconds to derive useful information from a 1024 × 768 screen capture in early 2000. "This was far too slow for interactive assistance," he said. "And while we were able to build some interesting autonomous systems that walked through user interfaces, our original ideas just didn't seem practical. Systems like Sikuli show that I was wrong about the practicality of the approach."
New Challenges
Sikuli will bring new challenges for developers. For one thing, it has its own performance limitations. "Screen matching is certainly not as fast as making a direct procedure call into the application programming interface," Miller said.
Furthermore, Sikuli can't program what it can't see. For example, Yeh explained, if a window is hidden, Sikuli can't do anything with it. Because it's built on top of Java, advanced programmers can use Java APIs to work around this problem but this negates the whole point of programming using the GUI, and it's not practical for beginners.
Program construction needs further simplification. Miller said they've addressed this in several ways, such as building on Python. They also developed a specialized editor that makes it easy to treat screenshot images as first-class constants that you can easily capture, edit, and move around like numbers and strings. Additionally, they've created a recorder to automatically capture screenshots associated with a sequence of user actions, but it's not yet released.
Another challenge lies in more complex machine-processing applications. Vision systems don't recognize many patterns that are perfectly obvious to the human. St. Amant noted that system developers could have a hard time figuring out how an application failed when a problem arises from the machine recognizing patterns differently from what we would expect. A related problem arises from the way people interpret 2D representation of 3D objects. Applications that must interpret the spatial relationships of 3D objects require considerable processing power — for example, to determine the relative depth of two objects that appear near each other on the 2D screen. Consequently, Sikuli's performance might degrade with 3D gaming environments and environments where spatial relationships between objects are important.
"Dealing with moving, animated objects was hard," St. Amant added. "Screen-capture tools work with snapshots, and it's not always easy to track objects by sampling them. These problems can be overcome, but it will take a lot of time and work."
The Sikuli development team envisions the technology underlying a wide variety of applications. It could help improve program documentation and tutorials. It could also help automate user interactions with PCs and Web-based applications as well as the interface between the two. Applications based on Webcam monitors could, for example, alert a parent if a baby rolls onto its stomach.
St. Amant sees developers using Sikuli to create application add-ons without having to access source code. He also envisions improved accessibility for blind users through the identification of application icons on the basis of their appearance and for users with physical limitations through tracking and guiding uneven mouse movements to small targets. It could also enable more robust and capable macro recording, independent of applications.
"Sikuli is mainly a research project and aims to inspire new ways to apply computer vision in everyday tasks," said Yeh. "We'd like to keep it open source and make it free to all."
To give Sikuli a try, go to groups.csail.mit.edu/uid/sikuli.
George Lawton is a freelance technology writer based in Guerneville, California. Contact him at http://glawton.com.