Analyzing RAG Evaluation Results

This page explains how to analyze the results presented in the RAG Evaluation Dashboard. The dashboard provides insights into the performance of different vectorization plans by measuring metrics such as NDCG and Relevancy scores. Below is a breakdown of the features you can explore in the dashboard to better understand the output of RAG (Retrieval-Augmented Generation) evaluations.

RAG Evaluation Dashboard Overview

The RAG Evaluation Dashboard gives a comprehensive view of the performance of different vectorization plans. Each vectorization plan is evaluated against a set of questions to determine how well the vectorized data can retrieve relevant documents. The image below shows a sample dashboard.

Key Features

  1. Vectorization Plans:

    • Each column on the dashboard represents a different Vectorization Plan (VP), where various chunk sizes, dimensions, and configurations are applied to the vectorization process.

    • The header of each plan provides details such as:

      • Chunk Size: Determines the granularity of the document partitioning during vectorization.

      • Chunk Overlap: Defines the overlap between document chunks, impacting retrieval effectiveness.

      • Dimensions: The number of dimensions used in the vector space.

      • Top K: The number of top results retrieved during the evaluation.

      • NDCG: Normalized Discounted Cumulative Gain, which measures the quality of ranked results.

      • Avg. Relevancy: The average relevancy score of the retrieved documents for the given set of questions.

  2. Scoring Results:

    • NDCG (Normalized Discounted Cumulative Gain): This is a ranking quality metric that measures the effectiveness of the retrieval process based on the relevance of documents at various positions in the ranked list.

      • Higher NDCG indicates better ranking quality, where relevant documents are ranked higher.

    • Relevancy Score: This score represents how relevant the retrieved documents are to the input query.

      • It averages over all queries and measures how closely the vectorized results match the expected output.

    • The best scoring vectorization plan is marked with a green ring and a trophy icon in the top right corner.

  3. Question Scoring:

    • The dashboard shows how many questions out of a total (e.g., 103/103 Questions) were evaluated using each vectorization plan.

    • The results for each vectorization plan are displayed, showing the percentage of questions for which the vectorized model provided relevant results.

  4. Comparative Graphs:

    • For each vectorization plan, there are comparative bar graphs showcasing the performance for different evaluation metrics like NDCG and Relevancy.

    • These graphs make it easy to visualize the differences between vectorization plans.

  5. Detailed Question Categories:

    • Beneath the NDCG and Relevancy charts, you can see the synthetic questions that were used to assess the performance of each vector index.

    • Each question category includes a set of questions used to evaluate the vectorization plans, helping users understand how the system performs for different knowledge areas.

How to Analyze the Results

  • Identify the Best Performing Plan: By comparing the NDCG and Relevancy scores across different vectorization plans, you can determine which configuration is performing best for your dataset. For example, a plan with a higher NDCG score will generally provide more relevant ranked results.

  • Explore Question Categories: The dashboard allows you to dive into specific categories and see how well each vectorization plan handled particular types of questions, aiding in more targeted performance improvements.

By understanding and leveraging these features, you can refine the RAG process to achieve optimal performance for your use case.

Last updated