Configuring Google Cloud Storage Source Connector

The Google Cloud Storage (GCS) Source Connector allows you to integrate GCS as a data source for your pipelines. This guide explains the configuration options available when setting up a GCS source connector.

Configuration Fields

Google Cloud Storage Source Connector Configuration

Field Summary

Field	Description	Required
Name	A descriptive name to identify the connector within Vectorize.	Yes
Bucket	The name of your Google Cloud Storage bucket.	Yes
Service Account JSON	The JSON key file for the service account.	Yes

Prerequisites

Before configuring the connector, ensure you have:

Created a Google Cloud Storage bucket
Set up a Service Account with the necessary permissions
Generated a JSON key file for the Service Account

For detailed instructions, refer to our guide on how to set up a GCS bucket.

Finding Required Information in Google Cloud

Bucket Name

Go to the Google Cloud Console
Navigate to Cloud Storage > Buckets
Copy the name of the bucket you want to use

Google Cloud Storage Source Connector Configuration

Service Account JSON

In the Google Cloud Console, go to IAM & Admin > Service Accounts
Find the service account you want to use or create a new one
Create a new key or use an existing one, downloading it in JSON format

Google Cloud Service Account Creation

Make sure the Service Account has access to your bucket

Configuring the Integration

From the main menu, click on Source Connectors
Click New Source Connector
Select the Google Cloud Storage card
Fill in the required fields:
- Enter a descriptive name for the connector
- Input your GCS Bucket name
- Paste the entire contents of your Service Account JSON key file
Click Create Google Cloud Storage Integration to test and save your configuration

Configuring GCS for RAG Pipeline

Cloud Storage RAG Pipeline-Specific Settings

Field	Description	Required
Includes File Types	Specifies the types of files to be included (e.g., PDF, DOCX, HTML, Markdown, Text).	Yes
Check for Updates Every (s)	Interval (in seconds) at which the connector will check the GCS bucket for updates.	Yes
Recursively Scan	Whether the connector should recursively scan all folders in the GCS bucket.	No
Path Prefix	A prefix path to filter the files in the GCS bucket (optional).	No
Path Metadata Regex	A regex pattern used to extract metadata from the file paths (optional).	No
Path Regex Group Names	Group names for the regex pattern (used in the Path Metadata Regex) to label extracted metadata (optional).	No

Understanding Pipeline-Specific Configuration Properties for GCS

Recursively Scan

Description: When this option is unchecked (false), the connector only reads files from the root directory of the GCS bucket. When checked (true), the connector reads files from the root directory as well as all subdirectories in the bucket.
Behavior:
- If disabled, only objects located directly in the root directory of the GCS bucket are read and processed.
- If enabled, the connector recursively reads and processes objects from all directories and subdirectories within the bucket.
Relationship to Path Prefix: The Path Prefix is applied after the list of objects is retrieved. Therefore, if recursive scanning is enabled, the prefix will filter objects from the entire bucket hierarchy. If recursive scanning is disabled, the prefix will only filter objects in the root directory.

Path Prefix

Description: The Path Prefix is a string filter applied to object names in the GCS bucket. It controls which objects are loaded into the pipeline based on their names. This is often used to target specific directories or patterns in the object names.
Usage:
- The prefix should not start with a leading slash (/), as GCS object names don’t include one.
- For example, using pdfs/ as the prefix would limit the objects to those that start with pdfs/, such as pdfs/doc1.pdf.
- The prefix can also be more general. For example, a prefix of a will include all objects whose names start with the letter "a".
Interaction with Recursively Scan:
- If recursive scanning is enabled, the connector will retrieve all objects, and then the Path Prefix will filter them based on the prefix.
- If recursive scanning is disabled, only the objects in the root directory are retrieved, and the prefix filters these.

Path Metadata Regex and Path Regex Group Name

These parameters work together to allow you to extract metadata that may be part of the pathname.

Path Metadata Regex

Description: The Path Metadata Regex is a regular expression (regex) that extracts metadata from the full object name (the path of the file in GCS). The extracted metadata is included in the vector database along with the file contents, providing additional context for retrieval.
Usage:
- The regex must return match groups, which are enclosed in parentheses (()) in regex syntax.
- The extracted metadata can be used for filtering or querying documents within the vector database.
- The Path Metadata Regex is particularly useful if GCS object names contain important metadata (e.g., user IDs, timestamps, experiment IDs), which are often encoded in the object names by convention.
Examples:
- To extract the directory part of the object name, use the regex: ^(.*\/).
- To extract just the filename, use the regex: ([^\/]+)$.
- To extract both the directory and the filename in a single regex, use: ^(.*\/)?([^\/]+)$.
Example Use Case:
- You have a file stored as logs/2023/exp_123/log1.txt.
- A regex of ^(.*\/)?([^\/]+)$ would extract logs/2023/exp_123/ as the first group and log1.txt as the second group.
- Matched groups will be ingested into your vector search index. The name used in the metadata can be set using Path Regex Group Name.

Path Regex Group Names

Description: This field allows you to name the metadata fields extracted by the Path Metadata Regex. Each match group in the regex is assigned a name from this list.
Usage:
- If your regex contains multiple match groups (e.g., one for the directory and one for the filename), you can assign names such as directory and filename.
- If fewer names are provided than there are match groups in the regex, only the first group(s) will be named, and any remaining groups will default to meta_path_match.
- If no names are provided, the first match group will automatically be named meta_path_match.

Example: Extracting Directory and File Name as Metadata

Cloud Storage RAG Pipeline-Specific Settings

The Path Metadata Regex ^(.*\/)?([^\/]+)$ has two matching groups represented with parenthesis
- The first matching group is all the characters up to and including the last / in the fully qualified object path.
- The second matching group is all of the characters at the end of the fully qualified object path after the last /.
The Path Regex Group Names
- When the path has a match on the first group, it will be entered as metadata in the search index entry with a name of directory.
- When the path has a match on the second group, it will be entered as metadata in the search index entry with a name of filename.

If you use this configuration and your GCS bucket contains two files in it:

file.pdf

mydir / file2.pdf

The metadata for the extracted chunks for each file would have values of:

Object path	`directory`	`filename`
file.pdf		file.pdf
mydir/file2.pdf	mydir/	file2.pdf

Troubleshooting

If you encounter issues while creating the integration:

Verify that your Bucket name is correct
Ensure that the Service Account JSON is complete and correctly formatted
Check that the Service Account has the necessary permissions to access the bucket

For further assistance, please refer to the Google Cloud Storage documentation or contact Vectorize support.

Configuration Fields​

Field Summary​

Prerequisites​

Finding Required Information in Google Cloud​

Bucket Name​

Service Account JSON​

Configuring the Integration​

Configuring GCS for RAG Pipeline​

Understanding Pipeline-Specific Configuration Properties for GCS​

Recursively Scan​

Path Prefix​

Path Metadata Regex and Path Regex Group Name​

Path Metadata Regex​

Path Regex Group Names​

Example: Extracting Directory and File Name as Metadata​

Troubleshooting​