Wednesday, April 24, 2024

Making AI Less of a Hallucination

A Practical Take on Prompt Engineering

A robot looking at itself in a mirror, reaching out to touch its image in it. Source: Dall-E 3
Source: DALL-E 3

Imagine crafting a sleek, minimalist generative AI application designed to streamline our all-too-human shortcomings in spelling and grammar. Sounds practical, right? Well, here’s a lesson I learned along the way: utilize prompt engineering to ensure that the app avoids pesky hallucinations.

Agile development

I quickly assembled this app using just fifty-ish lines of Python code. I also crafted a system prompt to guide the large language model (LLM) towards generating the desired output.

Here’s a peek at the original prompt:

You are an advanced language model trained to identify and correct spelling and grammar errors while enhancing the clarity and conciseness of professional communication. Please review the text provided below, correct any errors, revise the content for professional clarity and conciseness without altering the original meaning. Respond back to this prompt with your revised text, followed by an itemized list that highlights the corrections you made. Please format your response as markdown.

Well, perhaps too agile...

I discovered after running a few emails and blog posts through the app that its generated output was a bit too foot loose and fancy free. The LLM got confused about splitting the results into two distinct sections that I wanted. Plus, it started seeing spelling and grammar errors that were not there - classic LLM hallucinations.

An image of a robot looking confused since it is hallucinating.  Source: Dall-E 3
Source: DALL-E 3

The hallucinations were more frequent when I submitted text that was clean in the first place. The LLM appeared confused, trying to please the boss by attempting to correct nonexistent errors.

Here is the original prompt:

tech writer generative AI application original text used to test LLM

And here is the output the LLM generated. The blue box highlights an error that isn't, and the purple highlights corrected text that wasn't:

tech writer generative AI application output with unimproved system prompt

Making adjustments

No problem, though. I sharpened my pencil and rewrote the prompt to be more specific on what I wanted to see and how it should be formatted:

You are an advanced model trained to identify and correct English language spelling and grammar errors while enhancing the clarity and conciseness of professional communication. Please review the text provided below, correct any errors, revise the content for professional clarity and conciseness without altering the original meaning. Respond back to this prompt with two sections. The first section shall shall be titled Revised Text:, and contain your revised text. The second section shall be titled Corrections:, and contain an bulletized list highlighting the corrections you made. If you cannot make corrections to the provided text, just say the provided text is grammatically correct. Finally, please emit your response in markdown format so it can be streamed inside a web application. From a styling perspective, when you generate the section headers, use level two markup, e.g., ## Revised Text:, ## Corrections:.

This revision of the prompt clearly states the requirements. It also emphasizes what the model should and should not return.

Cleaner output

With increased focus and direction, the LLM got its act together, emitting well-organized responses. The new prompt provided the necessary, additional context, eliminating hallucinations. Now, the generated content is consistent and reliable, just as we expect.

tech writer generative AI application output with improved system prompt

Feel free to fork

Curiosity piqued? Dive into the code on my GitHub and experiment with different models and prompts.

This fascinating example of AI refinement not only will enhance written communication, but also offers a glimpse into the future of personalized technology, accessible straight from your laptop. If you'd like to discuss this further, feel free to connect with me on LinkedIn.

Saturday, March 16, 2024

AI Writing Co-pilot

image of a AI robot, typing at a computer, presumably writing a blog post article, and wanting an application to help it write more clearly, concisely and effectively
Source: DALL-E 3

Generative AI applications like Grammarly are revolutionizing writing, enhancing quality, and boosting speed. Yet, the elephant in the room is data privacy. To sidestep concerns, partnering with services that offer robust data protection is key. Better yet, consider solutions that don't require sending your precious words to a distant server.

Some things are best left unsaid ..

.. and not processed by remote systems outside of your control.

Enter the realm of open large language models (LLMs). These models, unlike their proprietary cousins from OpenAI (ChatGPT) and Google (Gemini, formerly Bard), process data locally, putting your privacy concerns to rest. And, equally important, open source LLM have licensing terms that stipulate the output generated by them is owned by you, and not the creators of the LLM models (Note: always read the licensing terms).

Driven by a need for data privacy, I crafted a simple AI tool leveraging an open-source LLM to spruce up my writing. This journey wasn't just about creating; it was about diving deep into the world of generative AI and LLM customization.

Hello Hal, improve my writing ...

HAL 9000 from the movie 2001 A Space Odyssey, saying hello to Dave

As I wrote about in a previous blog post, I have been upskilling in generative artificial intelligence and machine learning. This opportunity allowed me to learn about generative AI application development with a use case that would directly benefit my daily personal and professional life. As I got on with this endeavor, one common challenge quickly surfaced.

.. but, Hal, help in the way I want.

A common issue with using LLM is that if you do not set explicit expectations on how they should generate content, they tend to hallucinate i.e. produce gibberish and unpredictable content. The solution here is employ a common technique called prompt engineering to provide that missing context to the LLM to avoid it going off the rails.

Choices, choices ...

The number of available open source LLM is growing, and they differentiate themselves based on size, performance benchmark scores, and how they are fine tuned to meet specific use cases. Naturally, evaluating these models can be daunting. Ollama is changing the game by:

  1. Offering a rich selection of models, all while upholding strict privacy standards,
  2. Bridging the gap between open-source LLMs and applications,
  3. Simplifying integration with a uniform API.

I find Ollama to be invaluable when integrating AI into applications.

Meta vs. Google

Meta's Llama2 versus Google's Gemma LLM logos depicted side by side

Evaluating open source LLM options, I considered Meta's Llama2 and Google's Gemma.

From a licensing perspective, I found Google's terms more favorable versus Meta's terms. In particular, when I read Meta's license terms, I found it to be silent on the ownership issue of generated output.

From a technical perspective, I used Ollama to pull both models and, using prompt engineering, applied the same system prompt. Gemma produced more predictable and neatly formatted output when compared to Llama2.

Google's Gemma was the clear winner.

The final product

Remarkably, I implemented this application using less than 60 lines of Python code, and a model file that contains the specialized system prompt to the LLM.

If you are interested in the Python source and the system prompt, you can download it from my Github.

Feel free to reach out to me on LinkedIn for questions about this project and how to integrate open source LLM in your projects.

Sunday, January 28, 2024

A GenAI Platform Build

TinyLlama LLM project mascot
TinyLlama - a small yet effective open source LLM model

I wanted a platform to support my upskilling efforts in generative AI, machine learning and natural language processing. Finding one that met my needs took me some time, labor and a surprisingly modest amount of cash.

My first stop along this journey was discovering a software stack within the Docker Hub repository that provided the necessary components to accelerate NLP application development. This specific open source project is called 'docker/genai-stack'

The GenAI stack

The GenAI stack architecture
The GenAI stack architecture

The genai-stack offers the following resources to support generative AI application development:

  1. neo4j - a docker container from Neo4j of a vector store database critical for RAG application development.
  2. Ollama - the Ollama project provides a docker container that exposes an API to applications, allowing interactions with open source large language models. Open source LLM models execute locally on the host computers and avoid sending private data to third-party services, i.e., OpenAI. Ollama provides a library of LLM models. Developers select one, then the genai-stack pulls, runs, and exposes the selected model's resources via API. Currently the Ollama docker container supports Linux and MacOS hosts; support for Windows hosts is on the roadmap.
  3. CPU execution vs. GPU acceleration - The Ollama container can recognize the presence of a GPU installed on the host and leverage it for accelerated LLM processing. Optionally, the developer can select CPU execution. The genai-stack allows this selection via command line switches when starting the stack.
  4. Container orchestration - Since the genai-stack packages several containers, orchestration is essential. The stack achieves this via docker-compose, embedding health checks to test container health and error checks to abort the application build and start-up processes.
  5. Closed LLM selection - The developer can utilize closed LLM models from AWS and OpenAI if desired.
  6. Embedding model choice - Similar to providing a choice of LLM, the genai-stack offers the developer a choice between open and closed source embedding models. Ollama supports closed embedding models from OpenAI and AWS, open models from Ollama, and, in addition, the SentenceTransformer model from
  7. LangChain API - given that the stack aggregates various LLM and embedding model resources, the stack abstracts this complexity by providing a standard API interface. The project leverages the LangChain Python framework, providing the developer a wide choice of programming APIs to meet various use cases.
  8. Demo applications - the genai-stack provides several demonstration applications. One that I found very valuable is the pdf_bot application. It implements the common 'Ask Your PDF' use case out of the box.

I needed a hardware update

I always favor repurposing old hardware for new use cases, versus buying new.  For example, until I embarked on this journey, I was using an ancient HP laptop initially designed for Windows 7. I loaded Ubuntu onto it long ago, primarily for a web browser and email reader. But could it do the job with this project?

Naively, I loaded the genai-stack onto it. Executing basic LLM queries was painfully slow because of the vintage Intel Core i3 CPU (1st generation!) and lack of an installed GPU.

A further constraint: the vintage CPU lacked the necessary AVX instruction and thus could not run the Ollama container.

Time for an upgrade!

Returning from a recent business trip, I was sitting in the Denver, CO, airport, waiting for a flight back to the metro NYC area. I was browsing through Amazon when a refurbished Dell Optiplex 9020 server caught my eye. 

Dell Optiplex 9020 Mini Tower
Dell Optiplex 9020 Mini Tower

The listed price at USD 250 was modest. But did its CPU support that AVX instruction?

The Amazon post indicated that the model uses the Intel i7 Core processor. The Dell published tech specs indicated Dell built these 9020 systems using the Intel 4th generation Haswell processor, which did, indeed, support the AVX (and AVX2) instructions.

With 32 GB of RAM and 1 TB of SSD, I felt it had sufficient dimensions and was a very economical choice.

The need for speed

Not content without having the option to accelerate LLM queries, I set out looking for an NVIDIA GPU for this server. There were two main physical limitations, one I could easily overcome; the other set a constraint on the specific GPU I eventually bought.

Not enough power or room
Dell Optiplex 9020 Mini Tower internals
  1. The server's physical layout - the SSD mounting cages physically constrain the area around the PCI expansion, limiting me to purchasing older, shorter NVIDIA GPU cards vs. the more modern, physically larger GPU.
  2. The stock power supply capacity - Dell built the server I purchased with a power supply rated 290 Watts. More reading and research into GPU options revealed that any GPU I did buy would add to the power draw, and the stock power supply was under-dimensioned.

I found this post in the Dell community forum which guided me to a power supply upgrade and the final selection of an NVIDIA GeForce 1050 Ti GPU. I purchased the higher capacity power supply from NewEgg and a used GPU from eBay.

NVIDIA GeForce 1050 Ti GPU
NVIDIA GeForce 1050 Ti GPU

How did the server perform?

Quite well, actually.

I built the system with the upgraded power supply and GPU, then loaded Ubuntu LTS with the NVIDIA GPU drivers. I forked the upstream project, made a few modifications, and pushed the fork to my GitHub repository.

TinyLama's small footprint fits inside the NVIDIA GeForce 1050 Ti GPU
TinyLlama's small footprint fits inside this GPU
  1. Ollama with GPU - The Ollama project provides a wide selection of models, varying in size and fine-tuned to various use cases. Given the modest dimensions of the 1050 Ti GPU, I was delighted to find the TinyLlama model fit within the GPU memory space. Even though TinyLlama has a small memory footprint, it performs well regarding response speed and generated output, as observed with the pdf_bot application using its web browser interface. 
  2. modifications - After I forked the project, I modified the to allow scanning, parsing and vectorizing Microsoft Word .docx files. In addition to querying your PDF documents with natural language queries, I can now make similar queries against MS Word documents.
  3. modifications - I modified the genai-stack provided module, pointing it to select the vectorized documents from neo4j and away from another demo application provided by the upstream project.
  4. creation - Wanting a command line interface to supplement the supplied web interface, I created a bash shell wrapper script for the API. The shell script executes curl, obtains the JSON API responses and provides a clean output on the terminal console. I also created a command line option to feed a list of queries from an input text file for batch processing several queries against a vectorized document, which is quite handy. I quickly created the shell script; it is underperforming in terms of its response time. I intend to build a better-performing Python script that has improved streaming.

Where I finally landed

Overall, I spent approximately USD 450 on this project, which met my original goals: owning a hardware platform, capable of Generative AI and NLP application development using open source LLM models, to assist me in my upskilling efforts. I am also satisfied because I reused aftermarket, aging hardware for modern application development with surprising performance characteristics. This approach fits with my personal ethos of being environmentally responsible.  I achieved this through implementing the concept of "reduce, reuse and recycle".

If you have further questions on this project and what it can do, feel free to reach out via InMail on LinkedIn.

Tuesday, December 19, 2023

Installing Kubernetes on Raspberry Pi

Kubernetes cluster built from three Raspberry Pi 4 single board computers
Kubernetes cluster powered by Raspberry Pi

I recently deployed Kubernetes on a cluster of three Raspberry Pi 4 single board computers (SBCs), each with 32 Megabytes of microSD storage and 4 Gigabytes of RAM.

Initially, I struggled with two approaches based from guides I found online. The first attempt was with Raspberry Pi OS, followed by a second using Ubuntu LTS 22.04. Both used Kubernetes packages from official repositories and Flannel and MetalLB for networking.

Both approaches resulted in integration challenges between K8S and networking based plugins, i.e., Calico, Flannel, and MetalLB.

After several attempts, I discovered k3s, a pre-integrated, single-binary Kubernetes distribution.  It supports single-node deployments and scaling with worker nodes.  It minimizes external dependencies, including the network plugins, which challenged me.  Surprisingly, k3s with Ubuntu LTS 22.04 consumed only about 4.8 GB of microSD storage before I started to onboard container images.

The outcome with k3s was a three-node Kubernetes cluster deployed within 90 minutes.

Here are the steps I took to deploy k3s with Ubuntu LTS Server for ARM:

Install OS on microSD cards

  1. Download and run the Raspberry Pi imager (click here for installation instruction on Ubuntu.)
  2. Click the 'gear wheel' to access the configuration options to choose unique hostnames for your Pi boards (e.g., k3s-server for the master node).
  3. Create a filesystem image on a microSD card for each board in your cluster.

Static DHCP Address Assignments

Assign static IP addresses to each Pi's NIC on your router or DHCP server to avoid operational issues during pod deployment and access.


For each Raspberry Pi board:

  1. Log into the master node k3s-server.
  2. Run sudo visudo and enable NOPASSWD: ALL for sudo group users.
  3. Upload the following bootstrap shell script.
  4. Save the output to a log file for troubleshooting with script bootstrap.log.
  5. Execute the script ./

Repeat the imaging and bootstrapping for additional worker nodes.

Preparing the Master Node

The master node setup is streamlined thanks to k3s:

  1. Upload the following setup script to the k3s-server master node.
  2. Create a log with script master-node.log.
  3. Execute the setup script.
  4. End logging with exit.

Deploying a Management Node

For remote cluster administration:

  1. On your laptop, install kubectl and helm.
  2. Run the following commands:

Testing Services through the Load Balancer

On the management node, run kubectl get services --all-namespaces and look for the service name test-cluster.  From that line obtain and write down the external IP address.   You will also need to write down the external port number.

The  test-cluster pod is hardcoded to use an internal port 3000, and the k3s LoadBalancer service will assign it a random external port.  In the 'PORT(S)' column, you can see both.  On the management node, use curl to retrieve a JSON object from this port by invoking curl to the k3s server external IP address and external port. If you are successful, you should see this:


You now have a functional, cost-effective Raspberry Pi Kubernetes cluster for testing containerized services.

Sunday, September 10, 2023

Creating a Tumblr timeline

Tumblr logo credit
Credit: Mashable: Bob Al-Green

I have been using Tumblr on and off for years.  I created my account around 2007 when David Karp launched the platform.   I was fascinated then with the concept - create anything was the slogan - and the technology. Tumblr makes it easy for a blogger to create multi modal posts with childlike ease.   I also like how it does not pose a hard limit on the amount of text content you can place inside a post too.

Like most social media platforms, you can embed a Tumblr post within a web page using embed HTML code.  And also like most social media platforms, Tumblr provides an API that allows third party developers to create software that interacts with the Tumblr platform.

Tumblr currently has two versions of their API - version 1 and 2.   Version 2 provide a rich set of capabilities, and supports various scripting languages.  That said I found the older API Version 1 to be useful as well.

Version 1 is a straightforward Javascript based API.  It does not require any client authentication to the platform, either.  I found both characteristics ideal for using it in web browsers. I use Version 1 to embed a timeline of Tumblr blogs in static webpages.  By doing so, I provide a Tumblr's blogging utility to static websites.

Tumblr blogs support hashtags, and its Version 1 API provides an option to return posts that only contain specified hashtags.   Coupled with additional Javascript logic, it is possible to create a whitelist / blacklist function that allows specific blog content to be embedded in webpages.

The Javascript code is hosted on The code consumes the result of the API v1 call, then proceeds to build embed codes from the post information contained in the result.

The code is object based. You create an instance by calling a constructor, in the HTML HEAD section, with a few defined options:

  • debug is a boolean flag which toggles verbose debugging code to the browser console,
  • blog_id is the Tumblr blog unique identifier,
  • limit is the maximum number of posts embedded in the page,
  • blacklisted_hashtags is a list of hashtag strings to filter out,
  • top_dom_element_id is the HTML element id value under which to append the posts,
  • fancy_posts is a boolean flag to use the Tumblr returned iframe (1) or have the javascript build its own native HTML code (0)

To embed the posts within the HTML body, you need to:

  1. create a HTML DIV element with a ID attribute, and
  2. give the ID a unique name within the HTML document, and
  3. ensure the ID name is specified as a value associated with the top_dom_element_id key in the object's constructor call, then
  4. embed the following Javascript in the HTML BODY (ensure the HTML DIV name is used in the getElementById call)

If you want to see a website that contains two instances of embedTumblrPosts, head over to

Recent Posts