Researchers announce GPT4Tools: a method for teaching LLMs how to use tools for visual tasks

LLMs are great with words but can't handle visual tasks like understanding images. Teaching them to use visual tools could make them much more capable.

A new paper introduces GPT4Tools - a method to efficiently teach existing LLMs to invoke tools for visual tasks without proprietary data.

My highlights from the paper:

Uses ChatGPT as a "teacher" to generate instructional data for other LLMs
Fine-tunes LLMs like Vicuna on this data using selective weight tuning (keeps base model frozen)
Allows smaller 13B LLM to match 175B GPT-3.5 on seen tools after tuning
Data augmentation with negative/context samples was found to be the secret sauce to get this to work
Can generalize to brand new visual tools in a zero-shot way

This is big because it shows we may not need hyper-expensive training of massive models to impart visual capabilities to LLMs. They seem to be generalizable enough that they can be taught to work with images. Some examples shown include counting objects or segmenting items in pictures using other tools.

With this approach, existing models can be made multi-modal! Pretty cool.

Full summary. Original paper is here.

submitted by /u/Successful-Western27
[link] [comments]