LLMs are great with words but can't handle visual tasks like understanding images. Teaching them to use visual tools could make them much more capable.
A new paper introduces GPT4Tools - a method to efficiently teach existing LLMs to invoke tools for visual tasks without proprietary data.
My highlights from the paper:
- Uses ChatGPT as a "teacher" to generate instructional data for other LLMs
- Fine-tunes LLMs like Vicuna on this data using selective weight tuning (keeps base model frozen)
- Allows smaller 13B LLM to match 175B GPT-3.5 on seen tools after tuning
- Data augmentation with negative/context samples was found to be the secret sauce to get this to work
- Can generalize to brand new visual tools in a zero-shot way
This is big because it shows we may not need hyper-expensive training of massive models to impart visual capabilities to LLMs. They seem to be generalizable enough that they can be taught to work with images. Some examples shown include counting objects or segmenting items in pictures using other tools.
With this approach, existing models can be made multi-modal! Pretty cool.
Full summary. Original paper is here.
[link] [comments]