Small experiment: I wired a local model + Vision to press real Mac buttons from natural language. Great for “batch rename, zip, upload” chores; terrifying if the model mis-locates a destructive button.
Open questions I’m hitting:
- How do we sandbox an LLM so the worst failure is “did nothing,” not “clicked ERASE”?
- Is fuzzy element matching (Vision) enough, or do we need strict semantic maps?
- Could this realistically replace brittle UI test scripts?
Reference prototype (MIT) if you want to dissect: https://github.com/macpilotai/macpilot
[link] [comments]