Configuring an S3 Source Connector
The S3 Source Connector allows you to integrate Amazon S3 or S3-compatible storage services as a data source for your pipelines. This guide explains the configuration options available when setting up an S3 source connector.
Configuration Fields
Configuring S3 Source Connector
Field Summary
Field | Description | Required |
---|---|---|
Name | A descriptive name to identify the connector within Vectorize. | Yes |
Bucket Name | The exact name of your S3 bucket | Yes |
Access Key | The Access Key ID for authentication | Yes |
Secret Key | The Secret Access Key corresponding to the Access Key ID | Yes |
Endpoint | The endpoint URL (only needed for S3-compatible services, optional for AWS S3) | No |
Region | The region where your bucket is located (only needed for S3-compatible services, optional for AWS S3) | No |
Allow as archive destination | Whether this connector can be used as an archive destination for pipeline results | No |
Important Note: The Region and Endpoint fields are only required when using S3-compatible services. For AWS S3, these fields can be left blank.
Finding Required Information in AWS S3
For a full walkthrough please see the walkthrough documentation for configuring an AWS S3 bucket for use with a Vectorize RAG Pipeline.
Bucket Name
To find your S3 bucket name:
- Log in to the AWS Management Console
- Navigate to the S3 service
- Your bucket name will be listed in the "Buckets" dashboard
IAM Credentials (Access Key and Secret Key)
To get your IAM credentials:
- Log in to the AWS Management Console
- Navigate to the IAM service
- Click on "Users" in the left sidebar
- Select the appropriate IAM user or create a new one
- Go to the "Security credentials" tab
- Under "Access keys", you can create a new access key or use an existing one
- When you create a new key, you'll be shown the Access Key ID and Secret Access Key. Make sure to save these securely.
Configuring the S3 Connector in a RAG Pipeline
You can think of the AWS S3 connector as having two parts to it. The first is the bucket and authorization. This part is re-usable across pipelines and allows you to connect to this same bucket in different pipelines without providing the credentials or bucket information every time.
The second part is the configuration that is specific to your RAG Pipeline, such as which files and directories should be processed.
The following table outlines the fields available when configuring an S3 source for use within a Retrieval-Augmented Generation (RAG) pipeline.
Field | Description | Required |
---|---|---|
Includes File Types | Specifies the types of files to be included (e.g., PDF, DOCX, HTML, Markdown, Text). | Yes |
Check for Updates Every (s) | Interval (in seconds) at which the connector will check the S3 bucket for updates. | Yes |
Recursively Scan | Whether the connector should recursively scan all folders in the S3 bucket. | No |
Path Prefix | A prefix path to filter the files in the S3 bucket (optional). | No |
Path Metadata Regex | A regex pattern used to extract metadata from the file paths (optional). | No |
Path Regex Group Names | Group names for the regex pattern (used in the Path Metadata Regex) to label extracted metadata (optional). | No |
Understanding Pipeline-Specific Configuration Properties
Recursively Scan
- Description: When this option is unchecked (
false
), the connector only reads files from the root directory of the S3 bucket. When checked (true
), the connector reads files from the root directory as well as all subdirectories in the bucket. - Behavior:
- If disabled, only objects located directly in the root directory of the S3 bucket are read and processed.
- If enabled, the connector recursively reads and processes objects from all directories and subdirectories within the bucket.
- Relationship to
Path Prefix
: ThePath Prefix
is applied after the list of objects is retrieved. Therefore, if recursive scanning is enabled, the prefix will filter objects from the entire bucket hierarchy. If recursive scanning is disabled, the prefix will only filter objects in the root directory.
Path Prefix
- Description: The
Path Prefix
is a string filter applied to object names in the S3 bucket. It controls which objects are loaded into the pipeline based on their names. This is often used to target specific directories or patterns in the object names. - Usage:
- The prefix should not start with a leading slash (
/
), as S3 object names don't include one. - For example, using
pdfs/
as the prefix would limit the objects to those that start withpdfs/
, such aspdfs/doc1.pdf
. - The prefix can also be more general. For example, a prefix of
a
will include all objects whose names start with the letter "a".
- The prefix should not start with a leading slash (
- Interaction with
Recursively Scan
:- If recursive scanning is enabled, the connector will retrieve all objects, and then the
Path Prefix
will filter them based on the prefix. - If recursive scanning is disabled, only the objects in the root directory are retrieved and the prefix filters these.
- If recursive scanning is enabled, the connector will retrieve all objects, and then the
Path Metadata Regex & Path Regex Group Name
These parameters work together to allow you extract metadata that may be part of the pathname.
Path Metadata Regex
-
Description: The
Path Metadata Regex
is a regular expression (regex) that extracts metadata from the full object name (the path of the file in S3). The extracted metadata is included in the vector database along with the file contents, providing additional context for retrieval. -
Usage:
- The regex must return match groups, which are enclosed in parentheses (
()
) in regex syntax. - The extracted metadata can be used for filtering or querying documents within the vector database.
- The
Path Metadata Regex
is particularly useful if S3 object names contain important metadata (e.g., user IDs, timestamps, experiment IDs), which are often encoded in the object names by convention.
- The regex must return match groups, which are enclosed in parentheses (
-
Examples:
- To extract the directory part of the object name, use the regex:
^(.*\/)
. - To extract just the filename, use the regex:
([^\/]+)$
. - To extract both the directory and the filename in a single regex, use:
^(.*\/)?([^\/]+)$
.
- To extract the directory part of the object name, use the regex:
-
Example Use Case:
- You have a file stored as
logs/2023/exp_123/log1.txt
. - A regex of
^(.*\/)?([^\/]+)$
would extractlogs/2023/exp_123/
as the first group andlog1.txt
as the second group. - Matched groups will be ingested into your vector search index. The name used in the metadata can be set using
Path Regex Group Name
- You have a file stored as
Path Regex Group Names
- Description: This field allows you to name the metadata fields extracted by the
Path Metadata Regex
. Each match group in the regex is assigned a name from this list. - Usage:
- If your regex contains multiple match groups (e.g., one for the directory and one for the filename), you can assign names such as
directory
andfilename
. - If fewer names are provided than there are match groups in the regex, only the first group(s) will be named, and any remaining groups will default to
meta_path_match
. - If no names are provided, the first match group will automatically be named
meta_path_match
.
- If your regex contains multiple match groups (e.g., one for the directory and one for the filename), you can assign names such as
Example: Extracting Directory and File Name as Metadata
- The Path Metadata Regex
^(.*\/)?([^\/]+)$
has two matching groups represented with parenthesis- The first matching group is all the characters up to and including the last
/
in the full qualified object path - The second matching group is all the of characters at the end of the fully qualified objet path after the last
/
- The first matching group is all the characters up to and including the last
- The Path Regex Group Names
- When the path has a match on the first group, it will be entered as metadata in the search index entry with a name of directory
- When the path has a match on the second group, it will be entered as metadata in the search index entry with a name of filename
If you use this configuration and your S3 bucket contains two files in it:
file.pdf
mydir /
file2.pdf
The metadata for the extracted chunks for each file would have values of:
Object path | directory | filename |
---|---|---|
file.pdf | file.pdf | |
mydir/file2.pdf | mydir/ | file2.pdf |
Troubleshooting
If you encounter issues while creating the integration:
- Double-check that your Access Key and Secret Key are correct and active.
- Ensure that the IAM user associated with these credentials has the necessary permissions to access the specified S3 bucket.
- For S3-compatible services, verify that the Endpoint and Region information is accurate.
For further assistance, please see the documentation on how to contact support.
S3 Archive Feature
The S3 Archive feature allows you to write the results of your pipeline to both a vector database and an S3 bucket. This provides a backup of your pipeline results and enables additional use cases for the processed data.
Configuring S3 for Archive Use
To use an S3 bucket for archiving, you need to:
- Configure an S3 source connector with the "Allow as archive destination" option enabled
- Ensure the AWS credentials have write permissions to the bucket
Required Permissions
When using S3 as an archive destination, your IAM user needs the following additional permissions:
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": ["arn:aws:s3:::YOUR_BUCKET_NAME/*"]
}
Enabling a Connector for Archive Use
When creating or editing an S3 source connector, you can enable it for archive use by checking the "Allow as archive destination" option.
This makes the connector available for selection during pipeline creation when configuring the archive destination.