Embed#

The Embed Transform is responsible for generating embeddings for your Documents or Elements. These embeddings are stored in a special embedding property on each document.

This Embed transform takes a single argument – the embedder, which encapsulates a specific embedding model and it’s parameters. The currently supported models are listed below. More information can be found in the API documentation

SentenceTransformers#

The SentenceTransformerEmbedder embeds the text representation of each document using any of the models from the popular SentenceTransformers framework. The embeddings are computed locally, and Sycamore will automatically batch records and leverage GPUs where appropriate.

The following exmaple code embeds a DocSet with the all-MiniLM-L6-v2 model:

embedder = SentenceTransformerEmbedder(batch_size=100, model_name="sentence-transformers/all-MiniLM-L6-v2")
embedded_doc_set = docset.embed(embedder)

OpenAI Embeddings#

The OpenAIEmbedder embeds the text representation of each document using the text-embedding-ada-002 model. The model_batch_size parameter controls how many documents are sent to the OpenAI endpoint as a single call. For example, the following snippet will send records in batches of 1000 to OpenAI, using the default embedding model (text-embedding-ada-002).

embedded_doc_set = docset.embed(OpenAIEmbedder(model_batch_size=1000))

By default the transform will look for an OpenAI API key in the OPENAI_API_KEY environment variable. It can also be optionally passed in via the api_key parameter.

Amazon Bedrock Embeddings#

The BedrockEmbedder calls the Amazon Bedrock service to compute embeddings. Currently the only supported embedding model in Amazon Bedrock is amazon.titan-embed-text-v1. Sycamore makes its API calls to Bedrock using the boto3 library. Since Sycamore will compute embeddings in parallel, rather than passing in boto3 client directly, you pass in the arguments necessary to construct a boto3 Session object. This is most often used to configure AWS credentials. For example, to use a specific access key id and secret access key, you could call embed as follows:

embedded_doc_set = docset.embed(BedrockEmbedder(boto_session_kwargs={
    "aws_access_key_id": "<access_key>",
    "aws_secret_access_key":
    "<secret access key>"}))

Sycamore will then construct a boto3 Session object on each executor using the specified credentials. If you do not specify credentials the standard credential resolution mechanisms are used. More information on AWS credentials can be found here.

The bedrock APIs do not support batching, so an API call will be made for each document in the DocSet.