Training Vision-Language Models for BLV-Aligned Diagram Descriptions using Sighted User Feedback
Training Vision-Language Models for BLV-Aligned Diagram Descriptions using Sighted User Feedback

Training Vision-Language Models for BLV-Aligned Diagram Descriptions using Sighted User Feedback

Sightation: Using Sighted Feedback to Build Better Diagram Descriptions for BLV Users

This paper introduces a novel approach to creating high-quality diagram descriptions for blind and low-vision (BLV) users by leveraging sighted user feedback on VLM-generated descriptions rather than asking them to write descriptions from scratch.

The key insight is that sighted users can evaluate effectively even if they aren't skilled at producing BLV-optimized descriptions. The researchers:

  1. Generate diverse candidate descriptions using GPT-4V with different prompting strategies
  2. Collect sighted user feedback on these candidates
  3. Validate with BLV educators that this approach creates useful descriptions
  4. Build comprehensive datasets for multiple tasks

Key Technical Contributions:

  • Multi-pass inference approach: Used progressive prompting to generate diagram descriptions with increasing complexity/specificity
  • Annotation protocol: Designed efficient protocol for collecting sighted user evaluations of:

    • Description completion
    • Comparative preference
    • Verification of description accuracy
  • Dataset creation: Released 5 datasets (137K samples across 5K diagrams):

    • SightCOMPLETE: 50K samples with completion annotations
    • SightPREFER: 71K preference annotations between descriptions
    • SightRETRIEVE: 5K diagram-description matching samples
    • SightQA: 6K question-answer pairs about diagrams
    • SightREASON: 5K multi-step reasoning examples
  • Evaluation: BLV educators rated descriptions from sighted feedback as comparable or better than expert-written ones in terms of content coverage, sequence, and additional information.

  • Fine-tuning results: Models fine-tuned on Sightation datasets showed significant improvements:

    • LLaVA-1.5 improved from 12.4% to 53.7% win rate against ChatGPT
    • GPT-4V improved from 44.7% to 68.5% win rate in blind evaluations

I think this approach could be a game-changer for accessibility. Rather than relying on expensive BLV expert annotations or settling for lower-quality direct annotations from sighted users, this feedback-based approach produces high-quality descriptions at scale. The methodology could extend beyond diagrams to other visual accessibility challenges where the consumer and producer of descriptions have different visual abilities.

TLDR: The researchers created a method and datasets that use sighted user feedback on AI-generated diagram descriptions to create high-quality, BLV-aligned content. Models fine-tuned on these datasets produce significantly better descriptions for visually impaired users.

Full summary is here. Paper here.

submitted by /u/Successful-Western27
[link] [comments]