Google Drive

The Google Drive Source Connector allows you to integrate Google Drive as a data source for your pipelines. This guide explains the configuration options available when setting up a Google Drive connector.

Before you begin

Before starting, you'll need:

  • The JSON key token for your GCP service account.

  • Your GCP service account's email address.

If you don't have a GCP service account created yet, check out our guide How to Create a GCP Service Account for use with Google Drive.

Configure the Connector

To configure a connector to your Google Drive instance:

  1. Click Source Connectors from the main menu.

  2. Click New Source Connector from the Source Connectors page.

  3. Select the Google Drive card.

  4. Enter the name and your service account's JSON key, then click Create Google Drive Integration.

Configuring the Google Drive Connector in a RAG Pipeline

You can think of the Google Drive connector as having two parts to it. The first is authorization with your service account. This part is re-usable across pipelines and allows you to connect to this same service account in different pipelines without providing the credentials every time.

The second part is the configuration that's specific to your RAG Pipeline, such as which files and directories should be processed.

The following table outlines the fields available when configuring a Google Drive source for use within a Retrieval-Augmented Generation (RAG) pipeline.

FieldDescriptionRequired

Root Parents

Specifies the root folder id(s) to pull data from. These folders must be shared with the service account.

No

Include MIME Types

MIME types to include.

Yes

Exclude MIME Types

MIME types to exclude.

Yes

Path Metadata Regex

A regex pattern used to extract metadata from the file paths (optional).

No

Polling Interval

Interval (in seconds) at which the connector will check Google Drive for updates.

No

Path Regex Group Names

Group names for the regex pattern (used in the Path Metadata Regex) to label extracted metadata (optional).

No

Understanding Pipeline-Specific Configuration Properties for Google Drive

MIME Types

A MIME type (Multipurpose Internet Mail Extensions) is a standard way of identifying the format of a file. It's composed of a type and a subtype (e.g., image/png for PNG images or application/pdf for PDFs).

You can include specific MIME types to allow certain file formats, or exclude them to block unwanted formats, ensuring secure and appropriate file handling in your system.

Examples:

  • To include PDFs and plain text files: include application/pdf and text/plain.

  • To block Google Docs files: exclude application/vnd.google-apps.document.

Path Metadata Regex & Path Regex Group Name

These parameters work together to allow you to extract metadata that may be part of the pathname.

Path Metadata Regex

  • Description: The Path Metadata Regex is a regular expression (regex) that extracts metadata from the full object name (the path of the file in GCS). The extracted metadata is included in the vector database along with the file contents, providing additional context for retrieval.

  • Usage:

    • The regex must return match groups, which are enclosed in parentheses (()) in regex syntax.

    • The extracted metadata can be used for filtering or querying documents within the vector database.

    • The Path Metadata Regex is particularly useful if GCS object names contain important metadata (e.g., user IDs, timestamps, experiment IDs), which are often encoded in the object names by convention.

  • Examples:

    • To extract the directory part of the object name, use the regex: ^(.*\/).

    • To extract just the filename, use the regex: ([^\/]+)$.

    • To extract both the directory and the filename in a single regex, use: ^(.*\/)?([^\/]+)$.

  • Example Use Case:

    • You have a file stored as logs/2023/exp_123/log1.txt.

    • A regex of ^(.*\/)?([^\/]+)$ would extract logs/2023/exp_123/ as the first group and log1.txt as the second group.

    • Matched groups will be ingested into your vector search index. The name used in the metadata can be set using Path Regex Group Name.

Path Regex Group Names

  • Description: This field allows you to name the metadata fields extracted by the Path Metadata Regex. Each match group in the regex is assigned a name from this list.

  • Usage:

    • If your regex contains multiple match groups (e.g., one for the directory and one for the filename), you can assign names such as directory and filename.

    • If fewer names are provided than there are match groups in the regex, only the first group(s) will be named, and any remaining groups will default to meta_path_match.

    • If no names are provided, the first match group will automatically be named meta_path_match.

Example: Extracting Directory and File Name as Metadata

  1. The Path Metadata Regex ^(.*\/)?([^\/]+)$ has two matching groups represented with parenthesis

    • The first matching group is all the characters up to and including the last / in the fully qualified object path.

    • The second matching group is all of the characters at the end of the fully qualified object path after the last /.

  2. The Path Regex Group Names

    • When the path has a match on the first group, it will be entered as metadata in the search index entry with a name of directory.

    • When the path has a match on the second group, it will be entered as metadata in the search index entry with a name of filename.

If you use this configuration and your Google Drive folder contains two files in it:

file.pdf

mydir / file2.pdf

The metadata for the extracted chunks for each file would have values of:

Object path

directory

filename

file.pdf

file.pdf

mydir/file2.pdf

mydir/

file2.pdf

Share Google Drive Content with your Service Account Email

In order for your RAG Pipeline to ingest content from Google Drive, you'll need to share all content you'd like to make available to your pipeline with your service account's email address. You can share individual files and folders. Sharing a parent folder will share all content inside the folder, including content in any sub-folders.

What's next?

  • If you haven't yet built a connector to your vector database, go to Configuring Vector Database Connectors and select the platform you prefer to use for storing output vectors.

    OR

  • If you're ready to start producing vector embeddings from your input data, head to Pipeline Basics. Select your new connector as the data source to use it in your pipeline.

Last updated