How to evaluate a LLM (A step-by-step guide)



May 31, 2024

When evaluating a language model, it's important to consider the specific use case and tailor the evaluation approach accordingly. Each language model is unique on its own as it is trained on specific datasets to carry out a task. A specific use case is a list of actions that outlines the interactions between a user and a system to achieve an objective. Hence, finding specific use cases to test an AI model’s effectiveness is imperative in ascertaining which language model best caters to your needs. This article will take you through an easy step-by-step guide on evaluating which LLM to use.

1. Define Your Use Case and Evaluation Criteria

There are three use cases in language model evaluation:
- Text generation
- Question answering
- Summarisation

We will address a specific use case – question answering throughout this blog. 

Firstly, start by clearly defining the specific use case or task you want to evaluate the LLM for. This comes by asking yourself what is the objective you are trying to achieve and creating an actionable framework of workflow. 

One such framework is knowing how to create your own prompts. A good prompt should have a clearly defined structure and command so that the language model can produce the intended outputs. This is imperative as language models are sensitive to word choices, as well as its ordering. A well-defined prompt will create intended outputs.

Some guidelines to create your prompts:

  1. Outline the target audience the message is curated for.
  2. Use leading words such as ‘Think step by step’.
  3. Use delimiters to indicate boundary between options.
  4. Format your prompt according to the principles of intended output and order of your words.

As indicated above, the prompt structure for question answering is defined as – ‘#Instruction’, followed by ‘#Question’ or ‘Example’, and finally ‘#Choices’. The principle is to outline the context first, followed by the prompt.

The list of prompt guidelines are not exhaustive and is open for your creative implementation.

2. Choose Appropriate Evaluation Metrics

Based on your use case, identify personal criterions to evaluate and select the best language models.

Some personal criterions:

  • Output quality (fluency, coherence, relevance)
  • Factual accuracy
  • Task-specific performance (e.g. summarisation, question-answering)
  • Bias and toxicity
  • Diversity of outputs
  • Processing Time
  • Number of words accepted and generated

In the case of question answering, we want to consider the factual accuracy and output quality. Hence choosing the right language model is important. For instance, we have evaluated question answering using an industry-grade benchmark; the MMLU datasets on the following model – Gemma 2B-IT (16-bit Low Resource).

In the above example, the Gemma 2B-IT model is accurate in identifying the definition of ‘Judge Ad Hoc’ as well as providing a reasonable explanation. This is an apt evaluation in the use case of acquiring legal knowledge.

Note: There is a need for a high Random Access Memory (RAM) requirement for the usage of Llama 3-8B language models. It is recommended that users choose other language models if personal computers have 8GB RAM or less.

3. Consulting Experts

Consulting a domain expert is paramount in navigating today’s ever-changing landscape of technology development. Opilot is a local and secure copilot, and our team is able to provide expertise in the AI domain.

Try Opilot for free here, or if you’re interested in enterprise solutions, reach out to us for more information. Follow us on LinkedIn too!

P.S. Join our Discord to ask questions, or find out about updates as we roll them out.