Top of Mind

A small, white robot stands beside a wooden easel in a sunlit forest. The robot has a backpack and is holding a paintbrush in its right hand. It appears to be observing the scene in front of it, which includes a small stream with a waterfall and a mossy bank. The background is a lush, green forest with tall trees and dappled sunlight.

When I write blog and social media posts on Bluesky and Mastodon, I want a tool to analyze a supplied image and generate concise and accurate descriptions, called "alt-text," to benefit visually impaired readers and help me streamline the publishing effort. Thankfully, generative AI tools have expanded their ability to process and interpret images, making them ideal for this use case.

Not All LLMs Are Created Equal

The number of available LLMs has surged, each with distinct features and capabilities. Their performance, however, can vary significantly, even when you use the same prompt and image. These differences are often attributed to the “black box” nature of LLMs, i.e. differences between their internal structures and how they are trained.

Choices, Choices. Which One Should I Use?

I processed a set of test images with prompts across multiple models, then evaluated the generated alt text for accuracy and presentation. This subjective critique helped me choose the best model to meet my needs. You also learn which model is best in class regarding image processing and recognition.

Evaluating LLMs for Alt Text Generation

I selected four LLMs capable of image processing and recognition:

OpenAI ChatGPT-4.0
Cohere Command R Plus 08/2024
Anthropic Claude 3.5
Google Gemini 1.5

All models were tested in their free versions without subscriptions.

Prompt Engineering for Alt Text Output

Prompt engineering techniques ensured consistent output over multiple test runs and meaningful comparisons between the models. While I crafted a universal prompt for ChatGPT, Claude, and Gemini, I designed a distinct prompt for the Hugging Chat hosted Cohere model due to its reliance on an external tool, CogVLM, for image processing.

Example Prompt for Cohere:


        You are a skilled social media manager and blog author. Use the CogVLMv1
        Image Captioner tool to analyze the uploaded image. Rewrite its output
        into a concise and descriptive alt text for visually impaired readers.
        Provide factual image description. Include objects, background,
        interactions, gestures, poses, visible text, frequency. Describe colors,
        contrasts, textures, materials, composition, focus points, camera angle,
        perspective, context, lighting, shadows. Avoid subjective
        interpretations, speculation. Exclude introductory text, comments and
        administrative details. Write in English using its active voice, limited
        to five sentences. If the tool cannot be invoked, state: “Image
        Captioner tool not invoked, please try again in a new chat.” If the tool
        returns an error, relay the error message verbatim for troubleshooting.

Example Prompt for ChatGPT, Claude, and Gemini:


        You are a skilled social media manager and blog author. Analyze the
        supplied image to create concise and descriptive alt text for visually
        impaired readers. Provide factual image description. Include objects,
        background, interactions, gestures, poses, visible text, frequency.
        Describe colors, contrasts, textures, materials, composition, focus
        points, camera angle, perspective, context, lighting, shadows. Avoid
        subjective interpretations, speculation. Exclude meta-text, comments, or
        administrative details. Write in English using its active voice, limited
        to five sentences. If image processing fails, state: “I could not decode
        an image. Please try again.”

How Did They Compare?

ChatGPT 4.0 - ChatGPT excelled in accuracy and descriptive quality, delivering alt text that was clear, concise, and contextually appropriate. It consistently met my expectations and outperformed others in execution speed, making it the top performer.
Cohere Command R Plus - Cohere demonstrated strong performance, producing alt text comparable to ChatGPT. However, its reliance on the external CogVLM tool added complexity, and its execution speed was noticeably slower than ChatGPT.
Anthropic Claude 3.5 - Claude’s output was solid but fell short of the top two models. Its alt text tended to adopt a third-person tone, such as “The image depicts people doing stuff and enjoying themselves,” which felt less natural compared to ChatGPT’s more direct descriptions like “People doing stuff and enjoying themselves.”
Google Gemini 1.5 - Gemini ranked last. While it handled some images well, I noticed it occasionally hallucinated—generating descriptions that didn’t match the image. Additionally, this model refused to process images containing people, a significant limitation for creating alt text for diverse content.

ChatGPT Leads the Pack

From a qualitative standpoint, ChatGPT-4.0 and Cohere emerged as the image processing and recognition frontrunners. However, ChatGPT’s faster processing speed and ease of use gave it the edge in overall performance.

Generative AI makes social media and blog content more inclusive by helping improve accessibility for visually impaired readers while decreasing the publishing effort for creators.

Top of Mind

Pages

Sunday, December 15, 2024

Generative AI Aiding Accessibility, Quickly