Recently, I have been talking to a lot of LLM developers trying to understand the issues they face while building production-grade LLM applications. There's a certain similarity among all those interviews, most of them are not sure what to evaluate beside the extent of hallucinations.
To make that easy for you, here's a compiled list of the most important evaluation metrics you need to consider before launching your LLM application to production. I have also added notebooks for you to try them out:
Response Quality:
Metrics | Usage |
---|---|
Response Completeness | Evaluate if the response completely resolves the given user query. |
Response Relevance | Evaluate whether the generated response for the given question, is relevant or not. |
Response Conciseness | Evaluate how concise the generated response is i.e. the extent of additional irrelevant information in the response. |
Response Matching | Compare the LLM-generated text with the gold (ideal) response using the defined score metric. |
Response Consistency | Evaluate how consistent the response is with the question asked as well as with the context provided. |
Quality of Retrieved Context and Response Groundedness:
Metrics | Usage |
---|---|
Factual Accuracy | Evaluate if the facts present in the response can be verified by the retrieved context |
Response Completeness wrt Context | Grades how complete the response was for the question specified concerning the information present in the context |
Context Relevance | Evaluate if the retrieved context contains sufficient information to answer the given question |
Prompt Security:
Metrics | Usage |
---|---|
Prompt Injection | Identify prompt leakage attacks |
Language Quality of Response:
Metrics | Usage | |
---|---|---|
Tone Critique | Assess if the tone of machine-generated responses matches with the desired persona. | |
Language Critique | Evaluate LLM generated responses on multiple aspects - fluence, politeness, grammar, and coherence. |
Conversation Quality:
Metrics | Usage |
---|---|
Conversation Satisfaction | Measures the user’s satisfaction with the conversation with the AI assistant based on completeness and user acceptance. |
Some other Custom Evaluations:
Metrics | Usage |
---|---|
Guideline Adherence | Grade how well the LLM adheres to a given custom guideline. |
Custom Prompt Evaluation | Evaluate by defining your custom grading prompt. |
Cosine Similarity | Calculate cosine similarity between embeddings of two texts. |
BTW all these metrics are maintained by UpTrain, by far the best open-source tool that I have used for LLM evaluations.
[link] [comments]