Experiment results

Once an experiment has been completed, the results provide a comparison of each strategy's performance. Refer to the other topics within the Experiments documentation area to learn more about how choices influence the results.

The experiment results page is broken into 2 areas: the top section shows how accurate each strategy was to the given queries, while the bottom section shows what questions were a part of the queries.

The “Create Public Link” toggle provides an opportunity to share the experiment results publicly. Toggle the option ‘on’ to generate a shareable link that others can use to analyze results. Toggle the option ‘off’ to keep the results private.

Click the “Open in RAG Sandbox” button to navigate to a pre-loaded sandbox using your experiment strategies and data. Learn more about RAG Sandbox

Analyzing vectorization strategy results

Each vectorization strategy is a combination of an embedding model and chunking configuration. Read more about embedding models and chunking in the Vectorization Strategies area. Below is an example result.

The embedding model used was OpenAI v2 Small which has 1536 dimensions.

Chunking was done using paragraphs, at a size of 500 chars, with 50 char overlap.

Given the questions, a query returned results that were 0.1824 accurate.

The results had a 0.7485 NDCG rank.

The Vectorize Platform formed 20 questions to query the data and all 20 questions (100%) ran successfully.

About average relevancy

When searching vectors for semantically similar values, the results are vectors that are mathematically similar to each other. You describe how closely similar a vector is to the query by its relevance score. Typically, a search will return many results so the overall relevancy is the average of each score.

If you had stored the vectors 1, 2, 3, 4, and 5 and searched for 2.5 then the highest relevant results would be 2 and 3. The other numbers are less relevant, so their scores would be lower.

About normalized discounted cumulative gain (NDCG)

The goal of NDCG is to rank how useful a search result is. The assumption is that higher-ranking items should be given more credit than lower-ranking items.

NDGC compares each search result's relevance score to one another. If all the results are very relevant to the query then there should be a high NDCG ranking. If the search could not find data that was very similar to the query then each result will have a low relevance, thus the overall search will have a low NDCG rank. The ranking will be between 0 and 1. Where 1 represents a 100% NDCG ranking.

Comparing average relevance and NDCG

To find the most performant experiment strategy for your data, you’ll want to balance both relevance and usefulness. Referring to the example above there was a 72% average relevancy and 92% usefulness. That would seem quite contradicting. A relevance of 72% doesn’t seem very useful, but it got a 92% rank (which is a pretty good ranking).

There are many factors that can influence these scores, like

  • How was the original data formatted (HTML, clear text, XML, etc)

  • How much data was provided (2kb, 5MB, etc)

  • How was the data structured (headlines, sub-headlines, a blob of text, etc)

  • What chunking configuration was used (fixed length, recursive)

  • The detail of the query relative to the data it's searching

  • Possible nuances within the given embedding model

While the balance of relevance and usefulness will be specific to each experiment strategy, start with the goal of finding a strategy where both the relevancy and NDCG are above 85%.

Last updated