Azure Blob Storage
Last updated
Last updated
The Azure Blob Storage Source Connector allows you to integrate Azure Blob Storage as a data source for your pipelines. This guide explains the configuration options available when setting up a Azure Blob Storage connector.
Before starting, you'll need:
Your Azure storage account name.
Your Azure storage account key.
The Azure container's name.
If you don't have Azure Blog Storage set up already,, check out our guide How to Configure Azure Blob Storage.
Click Source Connectors from the main menu.
Click New Source Connector from the Source Connectors page.
Select the Azure Blob Storage card.
Enter connection parameters in the form using the Azure Blob Parameters table below as a guide.
Click Create Azure Blob Integration to test connector connectivity and save your configuration.
Field | Notes | Required |
---|---|---|
Name | A descriptive name to identify the connector within Vectorize. | Yes |
Container | Your source data files must be inside a container. | Yes |
Storage Account Name | Your Azure Blob Storage account name. | Yes |
Storage Account Key | Your Azure Blob Storage instance key. | Yes |
Endpoint | Your Azure Blob Storage endpoint. | No |
When you specify your configured Azure Blob Storage source in your pipeline configuration, Vectorize ingests all compatible files at the specified endpoint.
You can think of the AWS S3 connector as having two parts to it. The first is the bucket and authorization. This part is re-usable across pipelines and allows you to connect to this same bucket in different pipelines without providing the credentials or bucket information every time.
The second part is the configuration that is specific to your RAG Pipeline, such as which files and directories should be processed.
The following table outlines the fields available when configuring an S3 source for use within a Retrieval-Augmented Generation (RAG) pipeline.
Field | Description | Required |
---|---|---|
File Extensions | Specifies the types of files to be included (e.g., PDF, DOCX, HTML, Markdown, Text). | Yes |
Polling Interval (seconds) | Interval (in seconds) at which the connector will check the Azure container for updates. | Yes |
Recursively Scan | Whether the connector should recursively scan all folders in the S3 bucket. | No |
Path Prefix | A prefix path to filter the files in the Azure container (optional). | No |
Path Metadata Regex | A regex pattern used to extract metadata from the file paths (optional). | No |
Path Regex Group Names | Group names for the regex pattern (used in the Path Metadata Regex) to label extracted metadata (optional). | No |
If you haven't yet built a connector to your vector database, go to Configuring Vector Database Connectors and select the platform you prefer to use for storing output vectors.
OR
If you're ready to start producing vector embeddings from your input data, head to Pipeline Basics. Select your new connector as the data source to use it in your pipeline.
The following table outlines the fields available when configuring a Azure Blob Storage source for use within a Retrieval-Augmented Generation (RAG) pipeline.
Field | Description | Required |
---|---|---|
File Extensions | Specifies the types of files to be included (e.g., PDF, DOCX, HTML, Markdown, Text). | Yes |
Polling Interval | Interval (in seconds) at which the connector will check Azure Blob Storage for updates. | No |
Path Prefix | A prefix path to filter the files in the Blob (optional). | No |
Path Metadata Regex | A regex pattern used to extract metadata from the file paths (optional). | No |
Path Regex Group Names | Group names for the regex pattern (used in the Path Metadata Regex) to label extracted metadata (optional). | No |