Vectorization strategies

A vectorization strategy is the combination of an embedding model and a chunking configuration. Finding the right strategy depends on the type of data your experiment will be using and how you will be implementing the RAG pattern. The right strategy is typically one that consistently finds results with high relevance scores.

Choosing the right strategy

Vectorization models take in text and convert it to a collection of vectors called an embedding. There are many models in the ecosystem, each performing the conversion in a slightly different way. The Vectorize Team has spent time distilling many of those models down to a “Simple”, “Moderate”, and “Advanced” list. You choose from those strategies based on how nuanced your data is, versus the purpose of the given models.

Chunking (or splitting) is the act of breaking up large text into smaller segments. An embedding is generated for each chunk and saved to a vector database. The purpose of breaking up data into smaller segments is to help searchers find more relevant results. Part of the RAG pattern is finding semantically similar data to a given query. The Relevancy score is a number that represents that similarity.

Experimenting with different strategies on the same data gives you the opportunity to find a chunking and embedding model combination that best fits your needs. To understand these choices consider the following example.

Example vectorization strategy

Say you wanted to use the RAG pattern to search the text of every children's fairytale ever written. The goal would be to query the text with questions like “What color was Mary’s fleece” and “What do Little Red Riding Hood and the 3 Little Pigs have in common”.

The first question is specific to one fairytale. The second question is more generic and expects to find a commonality between multiple tales. If we create a single embedding per fairytale the search will not be very good at finding small nuances. It won’t be able to provide a result for the second question with a high relevance score.

If we chunk all the fairy tales into small sections though, the search will be able to find sections that are similar to our question. Thus, providing results that are more semantically similar with a higher relevancy score.

The size of the sections and how much they overlap will determine how relevant the search results are.

Supported embedding models

The Vectorize platform includes the following embedding models in vectorization strategies.

  • Voyage AI 2 (Dimensions: 1024)

  • OpenAI Ada v2 (Dimensions: 1536)

  • OpenAI v3 Large (Dimensions: 3072)

  • OpenAI v3 Small (Dimensions: 1536)

  • Mistral M. E5 Small (Dimensions: 384)

Chunking types

  • Paragraph - Segments text by its natural paragraph breaks, maintaining the logical grouping and contextual integrity of the content.

  • Fixed - Divides text into segments of a predetermined size, ideal for uniform analysis or processing.

  • Sentence - Splits text into individual sentences, facilitating detailed analysis and processing at the sentence level.

Last updated