Uploading documents

After you have named your new experiment, the next step is to add data that will be chunked and vectorized. You provide data to the experiment by uploading certain types of files. Currently, the supported file types are:

  • PDF

    The file should contain text for information that is indexable. PDFs with forms/inputs or other form objects will be ignored.

  • Markdown

    Any form of .md file is acceptable. The markdown will not be rendered, it will be indexed as plain text as is.

  • (clear) Text

    The simplest of all file formats. The file doesn’t necessarily need a .txt extension, it just needs to contain UTF-8 characters.

  • HTML

    The HTML tags are stripped from the text.

  • Doc/Docx

    .doc is an extension used for Microsoft Word documents. It was retired years ago but is still available for saving. It’s a clear text document that will be indexed as is.

    .docx is a Microsoft Word Open XML document. It can contain clear text (formatted as XML) as well as binary data. The binary data will not be of much use as a vector.

Limits

  • You are limited to 5 files in an experiment.

  • PDF, DOC, and HTML files are limited to 5 MB.

  • MD and TXT are limited to 500 KB.

Choosing the right data format

The format of your data can significantly influence its searchability. If you upload product documentation as markdown files, it will include formatting for headings, tables, bolding, etc etc. After vectorizing if you plan to search the documentation with unformatted clear text, your search results will most likely not be accurate. If your searches were formatted as markdown, you would be defeating the power of vectorization.

Ideally, the data that you upload will be in its final format. Preferably as UTF-8 text with little to no formatting. If you have HTML or markdown, run it through a rendering engine first. If you have a Docx, try to “save as” text to remove any XML and binary data.

Last updated