LLMโ€™s are notoriously hard to evaluate. Outputs can be non-deterministic, and itโ€™s hard to know what the โ€œcorrectโ€ answer is. Unless you are performing a simple classification task, you cannot simply string match the LLMโ€™s output to the correct answer.

Many Ways To Evaluate LLMs

  • Human in the loop
    • ๐Ÿ‘/๐Ÿ‘Ž + Reasoning
    • Oxen.ai Developer Docs
  • Simple Scoring with LLM as a judge
    • Summarization
  • RAG
    • WikiContradict
  • Agents (tool calling)
    • Tool calling dataset

Human Evaluation

TODO: Link to labeling workflow example. Or do we have human llm eval vs automated eval?

LLM As A Judge

Often you will want to automate your evaluation process, and save time and money by not having to manually label data. One way to do this is to use the LLM as a judge.

Note: It is always a good idea to have a human inspect outputs from the LLM as well, but automated evals can save you a lot of time, and help surface patterns in the data that are hard to detect otherwise.

Example: Evaluating an LLMโ€™s ability to extract information from context

Retrieval Augmented Generation (RAG) is a commonly used technique for generating responses from a known set of documents. For example, you may have an internal wiki of documents that you want to use an LLM to answer questions about.

The problem is, that LLMs are also trained on the internet, and may hallucinate if they are not given enough context or the right context. LLMs may have memorized information from the training data, and may not be able to generalize to new data.

Dataset: WikiContradict

Weโ€™ll use the WikiContradict dataset as an example.