In the ever-evolving landscape of Artificial Intelligence, Large Language Models (LLMs) stand as towering figures, offering insights and interactions that were once the stuff of science fiction. However, as with any tool, their effectiveness hinges on their performance, which can vary based on numerous factors including hardware and configuration settings. This blog post delves into an in-depth evaluation of several Chat LLMs, focusing on their accuracy, response time, and relevance, with a specific emphasis on the effects of hardware and chunk size.
The Experiment Setup
The evaluation was conducted using Python and the LlamaIndex framework, adapting code from this source. The setup involved using models such as OpenAI‘s various GPT versions (check out my previous blog post about Understanding OpenAI’s Language Models) and Llama2, tested on a high-end machine running Windows 11 Pro.
- Processor: AMD Ryzen 9 7950X 16-Core Processor 4.50 GHz
- RAM: 64.0 GB at 4800 MHz
- Graphics Card: NVIDIA GeForce RTX 4090 24GB
- Storage: Samsung SSD 908 PRO 1TB
Impact of Hardware on Performance
One of the pivotal findings was how significantly hardware specs can influence the performance of LLMs. For instance, when Llama-2-7b was initially run on a CPU, as expected, it exhibited exceedingly long response times. However, upon switching to GPU acceleration, these times were more than halved, underscoring the importance of utilizing appropriate hardware for optimal performance.
Comparative Analysis of Models
The tests revealed interesting insights into the response times and accuracy of different models. GPT-3.5 Turbo emerged as the fastest, clocking in at an average response time of 1.03 seconds with 128 token chunks. On the other hand, Llama-2-13b was the slowest (excluding the CPU tests of Llama2), with a response time of 29.22 seconds for 2048 token chunks.
But Response Time is not the only metric that matters in this experiment, let me tell you more.
Optimal Model Selection
The results were collected based on the following metrics:
- Response Time: the time between the user sending the question to the LLM and the user receiving the answer from the LLM.
- Faithfulness Evaluator: this tool is essential for assessing whether a response is fabricated or not. It evaluates the so-called “AI Hallucinations“, hence the degree to which a response from a query engine aligns with the source nodes.
- Relevancy Evaluator: this tool is key for determining whether a response addresses the query. It assesses the extent to which both the response and the source nodes correspond to the initial query.
Based on these tests, the “best” LLMs in terms of a balance between Speed, Faithfulness, and Relevance were:
- GPT-4: With an average response time of 2.72 seconds for 1024 token chunks, faithfulness score of 0.9, and relevance score of 0.8.
- Llama-2-13b: Achieving a response time of 4.6 seconds for 128 token chunks, faithfulness score of 0.9, and relevance score of 0.7.
Table of Results
Here is a clear comparison of the performance metrics across different models and configurations.
This evaluation underscores a crucial fact in the realm of AI and machine learning: How well something works depends a lot on the kind of equipment used and how it’s set up. So, we must test any machine meant for hosting local language models the same way (and ideally with more tests) to find the best setup.
We should point out that the LLMs tested in this study are among the most used, but they only make up a small part of all the available models. Each type is special and good for different things. This makes the world of LLMs a wide area that keeps getting bigger over time.
This blog post gives a quick look at the complicated world of Big Language Models and their differences in how they work.
As AI gets better, these tests are more important. They help us see how good the models are now and guide future growth in this fascinating area.
Learn more about LLM Hallucinations on YouTube: