Sightation: Using Sighted Feedback to Build Better Diagram Descriptions for BLV Users
This paper introduces a novel approach to creating high-quality diagram descriptions for blind and low-vision (BLV) users by leveraging sighted user feedback on VLM-generated descriptions rather than asking them to write descriptions from scratch.
The key insight is that sighted users can evaluate effectively even if they aren't skilled at producing BLV-optimized descriptions. The researchers:
- Generate diverse candidate descriptions using GPT-4V with different prompting strategies
- Collect sighted user feedback on these candidates
- Validate with BLV educators that this approach creates useful descriptions
- Build comprehensive datasets for multiple tasks
Key Technical Contributions:
- Multi-pass inference approach: Used progressive prompting to generate diagram descriptions with increasing complexity/specificity
Annotation protocol: Designed efficient protocol for collecting sighted user evaluations of:
- Description completion
- Comparative preference
- Verification of description accuracy
Dataset creation: Released 5 datasets (137K samples across 5K diagrams):
- SightCOMPLETE: 50K samples with completion annotations
- SightPREFER: 71K preference annotations between descriptions
- SightRETRIEVE: 5K diagram-description matching samples
- SightQA: 6K question-answer pairs about diagrams
- SightREASON: 5K multi-step reasoning examples
Evaluation: BLV educators rated descriptions from sighted feedback as comparable or better than expert-written ones in terms of content coverage, sequence, and additional information.
Fine-tuning results: Models fine-tuned on Sightation datasets showed significant improvements:
- LLaVA-1.5 improved from 12.4% to 53.7% win rate against ChatGPT
- GPT-4V improved from 44.7% to 68.5% win rate in blind evaluations
I think this approach could be a game-changer for accessibility. Rather than relying on expensive BLV expert annotations or settling for lower-quality direct annotations from sighted users, this feedback-based approach produces high-quality descriptions at scale. The methodology could extend beyond diagrams to other visual accessibility challenges where the consumer and producer of descriptions have different visual abilities.
TLDR: The researchers created a method and datasets that use sighted user feedback on AI-generated diagram descriptions to create high-quality, BLV-aligned content. Models fine-tuned on these datasets produce significantly better descriptions for visually impaired users.
Full summary is here. Paper here.
[link] [comments]