New comment by bckmn in 'Show HN: Tarsier – Vision utilities for web interaction agents'

Reminds me of [Language as Intermediate Representation](https://chrisvoncsefalvay.com/posts/lair/) - LLMs are optimized for language, so translate an image into language and they'll do better at modeling it.


Keyboard Shortcuts

Key Action
o Source
e Edit
i Insight
r Random
h Home
s or / Search
www.joshbeckman.org/replies/hacker-news-item-40369904