Embed#
- class sycamore.transforms.embed.BedrockEmbedder(model_name: str = 'amazon.titan-embed-text-v1', batch_size: int | None = None, pre_process_document: Callable[[Document], str] | None = None, boto_session_args: list[Any] = [], boto_session_kwargs: dict[str, Any] = {})[source]#
Bases:
Embedder
Embedder implementation using Amazon Bedrock.
- Parameters:
model_name -- The Bedrock embedding model to use. Currently the only available model is amazon.titan-embed-text-v1
batch_size -- The Ray batch size.
boto_session_args -- Arg parameters to pass to the boto3.session.Session constructor. These will be used to create a boto3 session on each executor.
boto_session_kwargs -- Keyword arg parameters pass to the boto3.session.Session constructor.
Example
embedder = BedrockEmbedder(boto_session_kwargs={'profile_name': 'my_profile'}) docset_with_embeddings = docset.embed(embedder=embedder)
- class sycamore.transforms.embed.BedrockEmbeddingModels(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- class sycamore.transforms.embed.Embed(child: Node, embedder: Embedder, **resource_args)[source]#
Bases:
MapBatch
Embed is a transformation that generates embeddings a docset using an Embedder.
The generated embeddings are stored in a special embedding property on each document. It utilizes an Embedder to perform the embedding process.
- Parameters:
child -- The source node or component that provides the dataset to be embedded.
embedder -- An instance of an Embedder class that defines the embedding method to be applied.
resource_args -- Additional resource-related arguments that can be passed to the embedding operation.
Example
source_node = ... # Define a source node or component that provides a dataset. custom_embedder = MyEmbedder(embedding_params) embed_transform = Embed(child=source_node, embedder=custom_embedder) embedded_dataset = embed_transform.execute()
- class sycamore.transforms.embed.OpenAIEmbedder(model_name: str | OpenAIEmbeddingModels = 'text-embedding-ada-002', batch_size: int | None = None, model_batch_size: int = 100, pre_process_document: Callable[[Document], str] | None = None, api_key: str | None = None, client_wrapper: OpenAIClientWrapper | None = None, params: OpenAIClientWrapper | None = None, **kwargs)[source]#
Bases:
Embedder
Embedder implementation using the OpenAI embedding API.
- Parameters:
model_name -- The name of the OpenAI embedding model to use.
batch_size -- The Ray batch size.
model_batch_size -- The number of documents to send in a single OpenAI request.
- class sycamore.transforms.embed.OpenAIEmbeddingModels(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- class sycamore.transforms.embed.SentenceTransformerEmbedder(model_name: str, batch_size: int | None = None, model_batch_size: int = 100, pre_process_document: Callable[[Document], str] | None = None, device: str | None = None)[source]#
Bases:
Embedder
SentenceTransformerEmbedder is an Embedder class for generating sentence embeddings using the SentenceTransformer model.
- Parameters:
model_name -- The name or path of the SentenceTransformer model to use for embedding.
batch_size -- The dataset batch size for embedding, if specified. Default is None.
model_batch_size -- The batch size used by the underlying SentenceTransformer model for embedding.
device -- The device (e.g., "cpu" or "cuda") on which to perform embedding.
Example
model_name="sentence-transformers/all-MiniLM-L6-v2" embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=UnstructuredPdfPartitioner()) .explode() .embed(embedder=embedder)