Skip to main content

Couchbase Capella

Approximate time to complete: 5-10 minutes, excluding prerequisites

This quickstart will walk you through creating and scheduling a pipeline that collects data from an Amazon S3 bucket, creates vector embeddings using an OpenAI embedding model, and writes the vectors to your Couchbase Capella index.

Before you begin

Before starting, ensure you have access to the credentials, connection parameters, and API keys as appropriate for the following:

Step 1: Create a Couchbase Capella cluster and search index

Create a Couchbase Capella cluster

For new Couchbase users

  1. If you've just created your Couchbase account, you'll be prompted to create a Provisioned Trial Cluster. Select Provisioned Trial.

    Provisioned Trial

  2. Choose AWS as the cloud provider, select your preferred region, and click Deploy Now.

    Deploy Now

For existing Couchbase users

  1. Navigate to the Couchbase Capella dashboard ↗. Select the Operational Clusters tab, then click Create Cluster.

    Create Cluster

  2. You'll be asked to select an existing project to associate your cluster with, then select Continue. If you'd like to create a new project, click on the provided link to create it, then return to the cluster dashboard and click Create Cluster.

    Create Cluster

  3. Choose AWS as the cloud provider, select your preferred region, and adjust the configuration as desired, then click Create Cluster.

    Create Cluster

Create a Cloud Capella Bucket

  1. Navigate to the Couchbase Capella dashboard ↗. Click on your cluster's name to go to the home page for your cluster.

    Cluster

  2. From the cluster's home page, click Data Tools, then select Create.

    Create

  3. In the pop-up, click New under Bucket. Check the box under Scope to use _default for the scope and collection, then click Create.

    Create

Create a Cloud Capella Search Index

  1. Save this JSON file. It contains the index definition you'll need to create your index.

    {
    "type": "fulltext-index",
    "name": "vectorize-quickstart-index",
    "uuid": "",
    "sourceType": "gocbcore",
    "sourceName": "vectorize-quickstart",
    "sourceUUID": "",
    "planParams": {
    "maxPartitionsPerPIndex": 1024,
    "indexPartitions": 1
    },
    "params": {
    "doc_config": {
    "docid_prefix_delim": "",
    "docid_regexp": "",
    "mode": "scope.collection.type_field",
    "type_field": "type"
    },
    "mapping": {
    "analysis": {},
    "default_analyzer": "standard",
    "default_datetime_parser": "dateTimeOptional",
    "default_field": "_all",
    "default_mapping": {
    "dynamic": false,
    "enabled": false
    },
    "default_type": "_default",
    "docvalues_dynamic": false,
    "index_dynamic": true,
    "store_dynamic": true,
    "type_field": "_type",
    "types": {
    "_default._default": {
    "dynamic": false,
    "enabled": true,
    "properties": {
    "chunk_id": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "analyzer": "keyword",
    "index": true,
    "name": "chunk_id",
    "type": "text"
    }
    ]
    },
    "origin_id": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "analyzer": "keyword",
    "index": true,
    "name": "origin_id",
    "type": "text"
    }
    ]
    },
    "source": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "analyzer": "keyword",
    "index": true,
    "name": "source",
    "type": "text"
    }
    ]
    },
    "text": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "analyzer": "keyword",
    "index": true,
    "name": "text",
    "type": "text"
    }
    ]
    },
    "unique_source": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "analyzer": "keyword",
    "index": true,
    "name": "unique_source",
    "type": "text"
    }
    ]
    },
    "user": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "analyzer": "keyword",
    "index": true,
    "name": "user",
    "type": "text"
    }
    ]
    },
    "vector": {
    "dynamic": false,
    "enabled": true,
    "fields": [
    {
    "dims": 1536,
    "index": true,
    "name": "vector",
    "similarity": "dot_product",
    "type": "vector",
    "vector_index_optimized_for": "recall"
    }
    ]
    }
    }
    }
    }
    },
    "store": {
    "indexType": "scorch",
    "segmentVersion": 16
    }
    },
    "sourceParams": {}
    }

  2. Navigate to the Couchbase Capella dashboard ↗. Click on your cluster's name to go to the home page for your cluster, then select Data Tools.

  3. Go to the Search section in the second menu, then click Create Search Index.

    Create Search Index

  4. Upload the index definition.

    • Select Advanced Mode.
    • Click Import from File, then upload the JSON file you just saved.

    Import JSON File

  5. Configure your index.

    • Select the bucket to use.
    • Adjust the index name as desired.
    • Click Create Index to complete the process.

    Capella will adjust the JSON in the index definition to include your bucket name and your modified index name.

    Create Index

Create Couchbase Cluster Access Credentials

  1. To configure an integration with Vectorize, you'll need credentials to access your cluster. Navigate to the Couchbase Capella dashboard ↗. Click on your cluster's name to go to the home page for your cluster, then click Settings. Under Security, click on Cluster Access, then select Create Cluster Access.

    Create Cluster Access

  2. Enter the following details, then click Create Cluster Access.

    • Cluster Access Name: The name you'll use to access this cluster.

    • Password: The password you'll use to access this cluster.

    • Bucket: Select the bucket you created earlier.

    • Scope: Select _default for the scope.

    • Access: Select Read/Write.

      Create Cluster Access

Configure Allowed IP Addresses

  1. Navigate to the Couchbase Capella dashboard ↗. Click on your cluster's name to go to the home page for your cluster, then click Settings. Under Networking_, click on Allowed IP Addresses, then select Add Allowed IP.

    Configure Allowed IPs

  2. Select Allow Access from Anywhere. After confirming your selection, click Add Allowed IP.

    Add Allowed IP

Access your Couchbase Connection String

  1. Navigate to the Couchbase Capella dashboard ↗. Click on your cluster's name to go to the home page for your cluster, then select Connect.

    Access Connection String

  2. Click the copy icon next to the connection string to copy it and store it safely. You'll need it for accessing your cluster through Vectorize.

Step 2: Create a RAG pipeline

Set Up Pipeline and Vector Database

  1. Open the Vectorize Application Console ↗
  2. From the dashboard, click on + New RAG Pipeline under the "RAG Pipelines" section.

New RAG Pipeline

  1. Enter a name for your pipeline. For example, you can name it quickstart-pipeline.
  2. Click on + New Vector DB to create a new vector database.

Name Pipeline

  1. Select Couchbase Capella from the list of vector databases.

    Select Couchbase Capella

  2. In the Couchbase Capella configuration screen, enter the following details, then click Create Capella Access.

  • Name: The name you'll use to refer to this integration in Vectorize.

  • Username: The cluster access name you created earlier.

  • Password: The cluster access password you created earlier.

  • Connection String: The connection string you saved earlier.

    Create Capella Access

  1. Once your Capella integration has been created, you'll set configuration values for your pipeline. Enter the following details:
  • Name: The name you'll use to refer to this pipeline in Vectorize.

  • Bucket Name: The Capella bucket you created earlier.

  • Scope Name: Enter _default, which is what you used earlier when configuring your cluster.

  • Connection Name: Enter _default, which is what you used earlier when configuring your cluster.

  • Search Index Name: The name of the Capella search index you created earlier.

    Configure Pipeline

Configure AI Platform

  1. Click on + New AI Platform.
  2. Select OpenAI from the AI platform options.

Select OpenAI

  1. In the OpenAI configuration screen:
    • Enter a descriptive name for your OpenAI integration.
    • Enter your OpenAI API Key.

Configure OpenAI

  1. Leave the default values for embedding model, chunk size, and chunk overlap for the quickstart.

Add Source Connector

Add Source Connector

  1. Click + Add source connector to add a source connector to your pipeline.
  2. Choose Amazon S3 from the list of source connector options.

Select Amazon S3

  1. In the Amazon S3 configuration screen:
    • Name your integration. It can be the same as your bucket name, but it doesn't have to be.
    • Enter your Bucket Name exactly as it appears in AWS.
    • Provide the Access Key and Secret Key for your AWS IAM user.

Configure S3

  1. Accept the default values for file extensions and other options.
  2. Click Save Configuration.

S3 Pipeline Configuration

Finalize Pipeline Creation

  1. After configuring the S3 integration, you should see it listed under Source Connectors.

  2. Click Next: Schedule Pipeline to continue.

    Schedule Pipeline

  3. Set the schedule type and frequency for the pipeline.

  4. Leave the default values for the pipeline schedule for now.

  5. Click Create RAG Pipeline.

Configure Pipeline Schedule

Monitor Pipeline Creation and Backfilling

  1. After clicking Create RAG Pipeline, you will see the pipeline creation progress.
  2. The stages include:
    • Creating pipeline
    • Deploying pipeline
    • Starting backfilling process

Creating Pipeline

  1. Once the pipeline is created and deployed, it will begin the backfilling process.
  2. You can monitor the pipeline status and view the progress of document ingestion and vector creation.
  3. If your S3 bucket is empty, the pipeline will show 0 Documents, 0 Chunks, and 0 Vectors.

Pipeline Backfilling

Step 3: Upload Files to Your S3 Bucket

Prepare Sample Data

  1. Download the friends-scripts.zip file from the following location

    Download: Friends Scripts (ZIP file)

  2. After downloading the friends-scripts.zip file, extract it to a location on your local machine.

  3. On most operating systems, you can do this by right-clicking the zip file and selecting Extract All or Unzip.

Upload Files to S3

  1. Log into your AWS S3 account and navigate to the Buckets section.
  2. Filter to find your bucket by typing its name in the search bar.

Find Your Bucket

  1. Click on your bucket name to open the detailed bucket view.

Open Bucket

  1. Click on the Upload button in the top right corner of the bucket's detail view.

Click Upload

  1. You can either drag and drop the extracted files from the friends-scripts directory into the upload area, or click on Add files to browse your local machine and select them manually.

Drag and Drop or Add Files

  1. After adding the files, you should see them listed under the Files and folders section of the upload screen.

Files Listed for Upload

  1. Once you've confirmed that all the files are listed, click on the Upload button at the bottom of the screen to start the upload process.

    Your files will now be uploaded to your S3 bucket.

Verify Pipeline Processing

  1. Within a few seconds after the upload is complete, you should see the content of your files start to populate in the RAG pipeline.
  2. The backfilling process will show progress as it reads and processes the documents from your S3 bucket.

Backfilling Progress

  1. Total Documents and Total Chunks will increase as the documents are embedded and processed.
  2. You can track the number of documents being embedded and vectors being written.
  3. After a minute or two of processing, you should see the total number of uploaded documents reflected in the pipeline's statistics.
  4. If you used the Friends Scripts documents as recommended, you will see 228 documents displayed in the Total Documents field.

Uploaded Documents Processed

Step 4: Play with Your Data in the RAG Sandbox

Access the RAG Sandbox

  1. From the main pipeline overview, click on the RAG Pipelines menu item to view your active pipelines.

Open RAG Pipeline Menu

  1. Find your pipeline in the list of pipelines.
  2. Click on the magnifying glass icon under the RAG Sandbox column to open the sandbox for your selected pipeline.

Open RAG Sandbox

Query Your Data

  1. In the sandbox, you can ask questions about the data you've ingested.
  2. Type a question related to your dataset in the Question field. For example, "What characteristics define the relationship between Ross and Monica?" if you're working with the Friends TV show scripts.
  3. Click Submit to send the question.

Ask a Question

Review Results

  1. After submitting your question, the sandbox will retrieve relevant chunks from your vector database and display them in the Retrieved Context section.
  2. The response from the language model (LLM) will be displayed in the LLM Response section.
    • The Retrieved Context section shows the chunks that were matched with your question.
    • The LLM Response section provides the final output based on the retrieved chunks.

Retrieved Chunks and LLM Response

  1. You can continue to ask different questions or refine your queries to explore your dataset further.
  2. The sandbox allows for dynamic interactions with the data stored in your vector database.

That's it! You're now able to explore your data using the RAG Sandbox.

Was this page helpful?