When I write blog and social media posts on Bluesky and Mastodon, I want a tool to analyze a supplied image and generate concise and accurate descriptions, called "alt-text," to benefit visually impaired readers and help me streamline the publishing effort. Thankfully, generative AI tools have expanded their ability to process and interpret images, making them ideal for this use case.
Not All LLMs Are Created Equal
The number of available LLMs has surged, each with distinct features and capabilities. Their performance, however, can vary significantly, even when you use the same prompt and image. These differences are often attributed to the “black box” nature of LLMs, i.e. differences between their internal structures and how they are trained.
Choices, Choices. Which One Should I Use?
I processed a set of test images with prompts across multiple models, then evaluated the generated alt text for accuracy and presentation. This subjective critique helped me choose the best model to meet my needs. You also learn which model is best in class regarding image processing and recognition.
Evaluating LLMs for Alt Text Generation
I selected four LLMs capable of image processing and recognition:
- OpenAI ChatGPT-4.0
- Cohere Command R Plus 08/2024
- Anthropic Claude 3.5
- Google Gemini 1.5
All models were tested in their free versions without subscriptions.
Prompt Engineering for Alt Text Output
Prompt engineering techniques ensured consistent output over multiple test runs and meaningful comparisons between the models. While I crafted a universal prompt for ChatGPT, Claude, and Gemini, I designed a distinct prompt for the Hugging Chat hosted Cohere model due to its reliance on an external tool, CogVLM, for image processing.
Example Prompt for Cohere:
|
Example Prompt for ChatGPT, Claude, and Gemini:
|
How Did They Compare?
-
ChatGPT 4.0 - ChatGPT excelled in accuracy and descriptive quality, delivering alt text that was clear, concise, and contextually appropriate. It consistently met my expectations and outperformed others in execution speed, making it the top performer.
-
Cohere Command R Plus - Cohere demonstrated strong performance, producing alt text comparable to ChatGPT. However, its reliance on the external CogVLM tool added complexity, and its execution speed was noticeably slower than ChatGPT.
-
Anthropic Claude 3.5 - Claude’s output was solid but fell short of the top two models. Its alt text tended to adopt a third-person tone, such as “The image depicts people doing stuff and enjoying themselves,” which felt less natural compared to ChatGPT’s more direct descriptions like “People doing stuff and enjoying themselves.”
-
Google Gemini 1.5 - Gemini ranked last. While it handled some images well, I noticed it occasionally hallucinated—generating descriptions that didn’t match the image. Additionally, this model refused to process images containing people, a significant limitation for creating alt text for diverse content.
ChatGPT Leads the Pack
From a qualitative standpoint, ChatGPT-4.0 and Cohere emerged as the image processing and recognition frontrunners. However, ChatGPT’s faster processing speed and ease of use gave it the edge in overall performance.
Generative AI makes social media and blog content more inclusive by helping improve accessibility for visually impaired readers while decreasing the publishing effort for creators.