Top of Mind

TinyLlama - a small yet effective open source LLM model

I wanted a platform to support my upskilling efforts in generative AI, machine learning and natural language processing. Finding one that met my needs took me some time, labor and a surprisingly modest amount of cash.

My first stop along this journey was discovering a software stack within the Docker Hub repository that provided the necessary components to accelerate NLP application development. This specific open source project is called 'docker/genai-stack'

The GenAI stack

The GenAI stack architecture

The genai-stack offers the following resources to support generative AI application development:

neo4j - a docker container from Neo4j of a vector store database critical for RAG application development.
Ollama - the Ollama project provides a docker container that exposes an API to applications, allowing interactions with open source large language models. Open source LLM models execute locally on the host computers and avoid sending private data to third-party services, i.e., OpenAI. Ollama provides a library of LLM models. Developers select one, then the genai-stack pulls, runs, and exposes the selected model's resources via API. Currently the Ollama docker container supports Linux and MacOS hosts; support for Windows hosts is on the roadmap.
CPU execution vs. GPU acceleration - The Ollama container can recognize the presence of a GPU installed on the host and leverage it for accelerated LLM processing. Optionally, the developer can select CPU execution. The genai-stack allows this selection via command line switches when starting the stack.
Container orchestration - Since the genai-stack packages several containers, orchestration is essential. The stack achieves this via docker-compose, embedding health checks to test container health and error checks to abort the application build and start-up processes.
Closed LLM selection - The developer can utilize closed LLM models from AWS and OpenAI if desired.
Embedding model choice - Similar to providing a choice of LLM, the genai-stack offers the developer a choice between open and closed source embedding models. Ollama supports closed embedding models from OpenAI and AWS, open models from Ollama, and, in addition, the SentenceTransformer model from SBert.net.
LangChain API - given that the stack aggregates various LLM and embedding model resources, the stack abstracts this complexity by providing a standard API interface. The project leverages the LangChain Python framework, providing the developer a wide choice of programming APIs to meet various use cases.
Demo applications - the genai-stack provides several demonstration applications. One that I found very valuable is the pdf_bot application. It implements the common 'Ask Your PDF' use case out of the box.

I needed a hardware update

I always favor repurposing old hardware for new use cases, versus buying new. For example, until I embarked on this journey, I was using an ancient HP laptop initially designed for Windows 7. I loaded Ubuntu onto it long ago, primarily for a web browser and email reader. But could it do the job with this project?

Naively, I loaded the genai-stack onto it. Executing basic LLM queries was painfully slow because of the vintage Intel Core i3 CPU (1st generation!) and lack of an installed GPU.

A further constraint: the vintage CPU lacked the necessary AVX instruction and thus could not run the Ollama container.

Time for an upgrade!

Returning from a recent business trip, I was sitting in the Denver, CO, airport, waiting for a flight back to the metro NYC area. I was browsing through Amazon when a refurbished Dell Optiplex 9020 server caught my eye.

Dell Optiplex 9020 Mini Tower

The listed price at USD 250 was modest. But did its CPU support that AVX instruction?

The Amazon post indicated that the model uses the Intel i7 Core processor. The Dell published tech specs indicated Dell built these 9020 systems using the Intel 4th generation Haswell processor, which did, indeed, support the AVX (and AVX2) instructions.

With 32 GB of RAM and 1 TB of SSD, I felt it had sufficient dimensions and was a very economical choice.

The need for speed

Not content without having the option to accelerate LLM queries, I set out looking for an NVIDIA GPU for this server. There were two main physical limitations, one I could easily overcome; the other set a constraint on the specific GPU I eventually bought.

Dell Optiplex 9020 Mini Tower internals

The server's physical layout - the SSD mounting cages physically constrain the area around the PCI expansion, limiting me to purchasing older, shorter NVIDIA GPU cards vs. the more modern, physically larger GPU.
The stock power supply capacity - Dell built the server I purchased with a power supply rated 290 Watts. More reading and research into GPU options revealed that any GPU I did buy would add to the power draw, and the stock power supply was under-dimensioned.

I found this post in the Dell community forum which guided me to a power supply upgrade and the final selection of an NVIDIA GeForce 1050 Ti GPU. I purchased the higher capacity power supply from NewEgg and a used GPU from eBay.

NVIDIA GeForce 1050 Ti GPU

How did the server perform?

Quite well, actually.

I built the system with the upgraded power supply and GPU, then loaded Ubuntu LTS with the NVIDIA GPU drivers. I forked the upstream project, made a few modifications, and pushed the fork to my GitHub repository.

TinyLama's small footprint fits inside the NVIDIA GeForce 1050 Ti GPU

TinyLlama's small footprint fits inside this GPU

Ollama with GPU - The Ollama project provides a wide selection of models, varying in size and fine-tuned to various use cases. Given the modest dimensions of the 1050 Ti GPU, I was delighted to find the TinyLlama model fit within the GPU memory space. Even though TinyLlama has a small memory footprint, it performs well regarding response speed and generated output, as observed with the pdf_bot application using its web browser interface.
pdf_bot.py modifications - After I forked the project, I modified the pdf_bot.py to allow scanning, parsing and vectorizing Microsoft Word .docx files. In addition to querying your PDF documents with natural language queries, I can now make similar queries against MS Word documents.
api.py modifications - I modified the genai-stack provided api.py module, pointing it to select the vectorized documents from neo4j and away from another demo application provided by the upstream project.
pdf_bot_cli.sh creation - Wanting a command line interface to supplement the supplied web interface, I created a bash shell wrapper script for the API. The shell script executes curl, obtains the JSON API responses and provides a clean output on the terminal console. I also created a command line option to feed a list of queries from an input text file for batch processing several queries against a vectorized document, which is quite handy. I quickly created the shell script; it is underperforming in terms of its response time. I intend to build a better-performing Python script that has improved streaming.

Where I finally landed

Overall, I spent approximately USD 450 on this project, which met my original goals: owning a hardware platform, capable of Generative AI and NLP application development using open source LLM models, to assist me in my upskilling efforts. I am also satisfied because I reused aftermarket, aging hardware for modern application development with surprising performance characteristics. This approach fits with my personal ethos of being environmentally responsible. I achieved this through implementing the concept of "reduce, reuse and recycle".

If you have further questions on this project and what it can do, feel free to reach out via InMail on LinkedIn.

Top of Mind

Pages

Sunday, January 28, 2024

A GenAI Platform Build