Embed#

class sycamore.transforms.embed.BedrockEmbedder(model_name: str = 'amazon.titan-embed-text-v1', batch_size: int | None = None, pre_process_document: Callable[[Document], str] | None = None, boto_session_args: list[Any] = [], boto_session_kwargs: dict[str, Any] = {})[source]#

Bases: Embedder

Embedder implementation using Amazon Bedrock.

Parameters:
  • model_name – The Bedrock embedding model to use. Currently the only available model is amazon.titan-embed-text-v1

  • batch_size – The Ray batch size.

  • boto_session_args – Arg parameters to pass to the boto3.session.Session constructor. These will be used to create a boto3 session on each executor.

  • boto_session_kwargs – Keyword arg parameters pass to the boto3.session.Session constructor.

Example

embedder = BedrockEmbedder(boto_session_kwargs={'profile_name': 'my_profile'})
docset_with_embeddings = docset.embed(embedder=embedder)
class sycamore.transforms.embed.BedrockEmbeddingModels(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

class sycamore.transforms.embed.Embed(child: Node, embedder: Embedder, **resource_args)[source]#

Bases: Transform

Embed is a transformation that generates embeddings a docset using an Embedder.

The generated embeddings are stored in a special embedding property on each document. It utilizes an Embedder to perform the embedding process.

Parameters:
  • child – The source node or component that provides the dataset to be embedded.

  • embedder – An instance of an Embedder class that defines the embedding method to be applied.

  • resource_args – Additional resource-related arguments that can be passed to the embedding operation.

Example

source_node = ...  # Define a source node or component that provides a dataset.
custom_embedder = MyEmbedder(embedding_params)
embed_transform = Embed(child=source_node, embedder=custom_embedder)
embedded_dataset = embed_transform.execute()
class sycamore.transforms.embed.OpenAIEmbedder(model_name: str | OpenAIEmbeddingModels = 'text-embedding-ada-002', batch_size: int | None = None, model_batch_size: int = 100, pre_process_document: Callable[[Document], str] | None = None, api_key: str | None = None, client_wrapper: OpenAIClientWrapper | None = None, params: OpenAIClientWrapper | None = None, **kwargs)[source]#

Bases: Embedder

Embedder implementation using the OpenAI embedding API.

Parameters:
  • model_name – The name of the OpenAI embedding model to use.

  • batch_size – The Ray batch size.

  • model_batch_size – The number of documents to send in a single OpenAI request.

class sycamore.transforms.embed.OpenAIEmbeddingModels(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

class sycamore.transforms.embed.SentenceTransformerEmbedder(model_name: str, batch_size: int | None = None, model_batch_size: int = 100, pre_process_document: Callable[[Document], str] | None = None, device: str | None = None)[source]#

Bases: Embedder

SentenceTransformerEmbedder is an Embedder class for generating sentence embeddings using the SentenceTransformer model.

Parameters:
  • model_name – The name or path of the SentenceTransformer model to use for embedding.

  • batch_size – The dataset batch size for embedding, if specified. Default is None.

  • model_batch_size – The batch size used by the underlying SentenceTransformer model for embedding.

  • device – The device (e.g., “cpu” or “cuda”) on which to perform embedding.

Example

model_name="sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .explode()
    .embed(embedder=embedder)