Interested in AI in content design? Check out our complete guide to artificial intelligence and content design.
Before we dive into qualitative AI evaluation for Large Language Models, let’s define some words:
- Model: A program trained on data to recognize specific patterns or make decisions.
- LLM (large language model): Built from large datasets of text, code, calculations, instructions, rules, processes, etc., to generate human-like text based on the patterns learned during the training process.
- Fine-tuning: Further training a pre-trained model on a specific, smaller dataset to improve its performance for a particular task.
- Outputs: LLM-generated content.
- Temperature: How “creative” the model is with the outputs. The higher the temperature, the greater the variety and also the potential to be more illogical.
What is AI evaluation?
Simply put, evaluations allow a deeper understanding of LLM performance. Content design is key in LLM evaluation – almost all activities are led by or involve content designers.
The evaluation process consists of determining inputs (information that the model considers), crafting an evaluation rubric (or, at the very least, heuristics and guidelines), and establishing ground truths (information that is considered correct to train, validate, and test models). From a content design lens, evaluation helps measure the quality of the outputs. To break it down even further:
- Evaluation: The process of assessing outputs against a rubric or general guidelines.
- Rubric: A set of criteria used to evaluate the quality of the outputs.
- Criteria: A question or statement within a rubric.
Why is AI evaluation important?
An overly simplified summary of content design’s role in a typical LLM development process looks something like this:
- Design guidelines and examples of the ideal output (North Stars).
- Collaborate with data science on the prompt.
- Define an evaluation rubric.
- Evaluate.
Without some form of evaluation, there’s a risk of harmful or inaccurate content that could be damaging. Evaluation is a way to conduct due diligence by ensuring the LLM-generated outputs meet a minimum level of quality that you deem acceptable. It is a tool to help determine if a model is ready to launch, issues that need to be addressed, and whether fine-tuning or further steering is required.
AI evaluation helps content designers determine whether the LLM is effectively executing the content strategy set. Lastly, working with LLMs involves collaboration between multiple disciplines – content design, product, data science, and engineering are in lockstep from prompting to fine-tuning. It’s through evaluation that we can see those efforts pay off.
Types of AI evaluations
There are different ways to evaluate, each with pros and cons.
- Manual evaluation: Humans manually review the outputs. This method ensures a high level of accuracy but is also the most time-consuming and least scalable.
- Automatic evaluation (also referred to as LLM-as-a-judge): A model evaluates another model’s outputs. This method is the fastest and provides the most coverage, but accuracy and reliability may not be guaranteed, especially with out-of-the-box solutions that have not been fine-tuned.
- Hybrid evaluation: A mix of both, with humans reviewing the auto-evaluator’s answers and analyses for accuracy.
This article will focus on manual evaluation. Within manual AI content evaluation, there can also be different kinds: a robust evaluation using a full rubric or a smaller-scale “taste test” to compare two outputs against each other quickly. The latter can help compare different models, implementations, temperatures, etc.
Depending on the situation, you might not need a full rubric. For example, when revising the prompt, it’s helpful to glance through and see if the prompt is generating acceptable outputs. This is where the North Star examples and/or guidelines come in handy – use them to determine if you’re on the right track.
It’s worth noting that the evaluations discussed in this article differ from traditional quantitative LLM metrics like perplexity (measures predictive ability) and FI score (calculates precision and recall). These quantitative metrics are typically used to track performance on specific tasks and guide model development.
In contrast, this article discusses evaluation through a UX content design lens, focusing on qualitative aspects such as readability, accessible language, and cultural appropriateness – and quantifying these human-centered quality metrics.
Developing an evaluation rubric
It might seem counterintuitive, but you don’t necessarily need a criterion for every content design nuance. Think about specific things in your content strategy that you want to ensure the LLM is executing properly. Does the tone need to be in the second person? Is there a specific CTA (call-to-action) required?
These will be key things to evaluate. Collaborate with your product and data science partners when developing the rubric.
Just as AI evaluation helps determine whether the content strategy is followed, it’s equally important to ensure that key product goals are met. Working with data science is also invaluable, as they will know the capabilities of each model best. Depending on how complex the tasks are, the model may not be able to generate exactly what you’re hoping for. Sometimes, you need to meet the model where it’s currently at.
As content designers, it can be easy to go in with pitchforks (tone! Oxford comma! readability!) and identify everything wrong from a content standpoint when evaluating outputs. This is intuitive for content designers because we’re trained to focus on elements like style and sentence structure, but a slight mindset shift is required when evaluating. In general, there are two different types of criteria:
- Core: Criteria that are table stakes and applicable in every scenario (e.g., toxicity, grammar, accuracy, etc.)
- Feature-specific: Context-driven criteria specific to each project and typically revolve around the style of the content (e.g., voice, tone, readability, etc.)
Both types are essential, but in terms of priority, core criteria often take precedence over feature-specific ones. This also ties into how evaluation results are used to make decisions. For example, you might decide there is zero tolerance for specific core criteria, such as toxicity or discrimination, but a wider margin of error on the other criteria.
Lastly, because LLMs are non-deterministic (likely to have a different result every time), you might not always see the same issues, especially after fine-tuning or other forms of steering. But this is also why existing criteria often remain fundamental.
Models may shift, or things may change, so having foundational criteria in place helps avoid regression in key areas. In addition, maintaining existing criteria also helps avoid complicating reporting and allows apples-to-apples comparisons. It’s through AI content evaluation that content design improvements become quantifiable.
Tips for developing an evaluation rubric
- Design responsively: Sometimes, you might not know all the criteria you’ll need to include in your rubric immediately. Beyond the evaluation criteria that address key elements of your content strategy, other issues may arise. You may need to first look at a sample of outputs to identify problematic areas and add criteria to help “catch” those issues. As the implementation evolves, ongoing qualitative AI evaluation can help you identify nuanced language issues to address.
- Keep it simple: To alleviate the cognitive load on human evaluators, ensure the rubric is easy to understand. Use simple language and aim for clarity over conciseness. This is a prime example of how classic content design comes into play.
- Break it down: Try to keep each criterion to one concept. “Is this content appropriate?” is subjective and could refer to many different things. This question can be broken down into the more objective “Does the output use the first person singular voice?“.
- Think about who will be evaluating: Not all reviewers are the same. Content designers will likely need less guidance in evaluating tone than a product manager or a data scientist. To address this chasm, include context examples to help clarify what the criterion is addressing. For example, “Does the text use softer, more cautious language like ‘could’, ‘might’, ‘may’, etc.?”
- Align on acceptable answers: Each person has a different mental model and unique histories that form their perspectives. This also means that two people may answer the same question differently, making it important to align on acceptable answers to avoid inconsistent interpretations. One way to gain alignment is to conduct calibration: have two reviewers evaluate a small number (5 or 10) of the same outputs, discuss differing answers, and reach a consensus. Calibration helps minimize misalignment, reduce individual biases, and increase consistency across evaluations.
There are many different approaches to evaluating LLM-generated content, and it comes down to what works for your needs and resources. At the end of the day, evaluation in any form and scale will always be beneficial as it is a tool that helps mitigate risk and measure quality.
Alice Chen is a UX Content Designer at Indeed.