DocSet#
- class sycamore.docset.DocSet(context: Context, plan: Node)[source]#
A DocSet, short for “documentation set,” is a distributed collection of documents bundled together for processing. Sycamore provides a variety of transformations on DocSets to help customers handle unstructured data easily.
- augment_text(augmentor: TextAugmentor, **resource_args) DocSet [source]#
Augments text_representation with external information.
- Parameters:
augmentor (TextAugmentor) – A TextAugmentor instance that defines how to augment the text
Example
augmentor = FStringTextAugmentor(sentences = [ "This pertains to the part {doc.properties['part_name']}.", "{doc.text_representation}" ]) entity_extractor = OpenAIEntityExtractor("part_name", llm=openai_llm, prompt_template=part_name_template) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .extract_entity(entity_extractor) .explode() .augment_text(augmentor)
- count() int [source]#
Counts the number of documents in the resulting dataset. It is a convenient way to determine the size of the dataset generated by the plan.
- Returns:
The number of documents in the docset.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .count()
- embed(embedder: Embedder, **kwargs) DocSet [source]#
Applies the Embed transform on the Docset.
- Parameters:
embedder – An instance of an Embedder class that defines the embedding method to be applied.
Example
model_name="sentence-transformers/all-MiniLM-L6-v2" embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .explode() .embed(embedder=embedder)
- explode(**resource_args) DocSet [source]#
Applies the Explode transform on the Docset.
Example
pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .explode()
- extract_batch_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet [source]#
Extracts a common schema from the documents in this DocSet.
This transform is similar to extract_schema, except that it will add the same schema to each document in the DocSet rather than infering a separate schema per Document. This is most suitable for document collections that share a common format. If you have a heterogeneous document collection and want a different schema for each type, consider using extract_schema instead.
- Parameters:
schema_extractor – A SchemaExtractor instance to extract the schema for each document.
Example
openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .extract_batch_schema(schema_extractor=schema_extractor)
- extract_entity(entity_extractor: EntityExtractor, **kwargs) DocSet [source]#
Applies the ExtractEntity transform on the Docset.
- Parameters:
entity_extractor – An instance of an EntityExtractor class that defines the entity extraction method to be applied.
Example
title_context_template = "template" openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) entity_extractor = OpenAIEntityExtractor("title", llm=openai_llm, prompt_template=title_context_template) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .extract_entity(entity_extractor=entity_extractor)
- extract_properties(property_extractor: PropertyExtractor, **kwargs) DocSet [source]#
Extracts properties from each Document in this DocSet based on the _schema property.
The schema can be computed using extract_schema or extract_batch_schema or can be provided manually in JSON-schema format in the _schema field under Document.properties.
Example
openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) property_extractor = OpenAIPropertyExtractor(OpenaAIPropertyExtrator(llm=openai_llm)) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partition=SycamorePartitioner()) .extract_properties(property_extractor)
- extract_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet [source]#
Extracts a JSON schema of extractable properties from each document in this DocSet.
Each schema is a mapping of names to types that corresponds to fields that are present in the document. For example, calling this method on a financial document containing information about companies might yield a schema like
{ "company_name": "string", "revenue": "number", "CEO": "string" }
This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.
The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.
- Parameters:
schema_extractor – A SchemaExtractor instance to extract the schema for each document.
Example
openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .extract_schema(schema_extractor=schema_extractor)
- filter(f: Callable[[Document], bool], **resource_args) DocSet [source]#
Applies the Filter transform on the Docset.
- Parameters:
f – A callable function that takes a Document object and returns a boolean indicating whether the document should be included in the filtered Docset.
Example
def custom_filter(doc: Document) -> bool: # Define your custom filtering logic here. return doc.some_property == some_value context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .filter(custom_filter)
- flat_map(f: Callable[[Document], list[Document]], **resource_args) DocSet [source]#
Applies the FlatMap transformation on the Docset.
- Parameters:
f – The function to apply to each document.
Example
def custom_flat_mapping_function(document: Document) -> list[Document]: # Custom logic to transform the document and return a list of documents return [transformed_document_1, transformed_document_2] context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .flat_map(custom_flat_mapping_function)
- limit(limit: int = 20) DocSet [source]#
Applies the Limit transforms on the Docset.
- Parameters:
limit – The maximum number of documents to include in the resulting Docset.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .explode() .limit()
- map(f: Callable[[Document], Document], **resource_args) DocSet [source]#
Applies the Map transformation on the Docset.
- Parameters:
f – The function to apply to each document.
- map_batch(f: Callable[[list[Document]], list[Document]], f_args: Iterable[Any] | None = None, f_kwargs: dict[str, Any] | None = None, f_constructor_args: Iterable[Any] | None = None, f_constructor_kwargs: dict[str, Any] | None = None, **resource_args) DocSet [source]#
The map_batch transform is similar to map, except that it processes a list of documents and returns a list of documents. map_batch is ideal for transformations that get performance benefits from batching.
Example
def custom_map_batch_function(documents: list[Document]) -> list[Document]: # Custom logic to transform the documents return transformed_documents map_ds = input_ds.map_batch(f=custom_map_batch_function) def CustomMappingClass(): def __init__(self, arg1, arg2, *, kw_arg1=None, kw_arg2=None): self.arg1 = arg1 # ... def _process(self, doc: Document) -> Document: doc.properties["arg1"] = self.arg1 return doc def __call__(self, docs: list[Document], fnarg1, *, fnkwarg1=None) -> list[Document]: return [self._process(d) for d in docs] map_ds = input_ds.map_batch(f=CustomMappingClass, f_args=["fnarg1"], f_kwargs={"fnkwarg1": "stuff"}, f_constructor_args=["arg1", "arg2"], f_constructor_kwargs={"kw_arg1": 1, "kw_arg2": 2})
- map_elements(f: Callable[[Element], Element], **resource_args) DocSet [source]#
Applies the given mapping function to each element in the each Document in this DocsSet.
- Parameters:
f – A Callable that takes an Element and returns an Element. Elements for which f evaluates to None are dropped.
- mark_bbox_preset(tokenizer: Tokenizer, token_limit: int = 512, **kwargs) DocSet [source]#
- Convenience composition of:
SortByPageBbox MarkDropTiny minimum=2 MarkDropHeaderFooter top=0.05 bottom=0.05 MarkBreakPage MarkBreakByColumn MarkBreakByTokens limit=512
Meant to work in concert with MarkedMerger.
- merge(merger: ElementMerger, **kwargs) DocSet [source]#
Applies merge operation on each list of elements of the Docset
Example
from transformers import AutoTokenizer tk = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") merger = GreedyElementMerger(tk, 512) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .merge(merger=merger)
- partition(partitioner: Partitioner, table_extractor: TableExtractor | None = None, **kwargs) DocSet [source]#
Applies the Partition transform on the Docset.
More information can be found in the Partition documentation.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner())
- query(query_executor: QueryExecutor, **resource_args) DocSet [source]#
Applies a query execution transform on a DocSet of queries.
- Parameters:
query_executor – Implementation for the query execution.
- random_sample(fraction: float, seed: int | None = None) DocSet [source]#
Retain a random sample of documents from this DocSet.
The number of documents in the output will be approximately fraction * self.count()
- Parameters:
fraction – The fraction of documents to retain.
seed – Optional seed to use for the RNG.
- regex_replace(spec: list[tuple[str, str]], **kwargs) DocSet [source]#
Performs regular expression replacement (using re.sub()) on the text_representation of every Element in each Document.
Example
from sycamore.transforms import COALESCE_WHITESPACE ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .regex_replace(COALESCE_WHITESPACE) .regex_replace([(r"\d+", "1313"), (r"old", "new")]) .explode()
- show(limit: int = 20, show_elements: bool = True, num_elements: int = -1, show_binary: bool = False, show_embedding: bool = False, truncate_content: bool = True, truncate_length: int = 100, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None [source]#
Prints the content of the docset in a human-readable format. It is useful for debugging and inspecting the contents of objects during development.
- Parameters:
limit – The maximum number of items to display.
show_elements – Whether to display individual elements or not.
num_elements – The number of elements to display. Use -1 to show all elements.
show_binary – Whether to display binary data or not.
show_embedding – Whether to display embedding information or not.
truncate_content – Whether to truncate long content when displaying.
truncate_length – The maximum length of content to display when truncating.
stream – The output stream where the information will be displayed.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .show()
- sketch(window: int = 17, number: int = 16, **kwargs) DocSet [source]#
For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents’ sketches can be compared to determine if they have near-duplicate content.
- Parameters:
window – Number of bytes in the sliding window that is hashed (17)
number – Count of hashes comprising a shingle (16)
Example
ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .explode() .sketch(window=17)
- split_elements(tokenizer: Tokenizer, max_tokens: int = 512, **kwargs) DocSet [source]#
Splits elements if they are larger than the maximum number of tokens.
Example
pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .split_elements(tokenizer=tokenizer, max_tokens=512) .explode()
- spread_properties(props: list[str], **resource_args) DocSet [source]#
Copies listed properties from parent document to child elements.
Example
pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .spread_properties(["title"]) .explode()
- summarize(summarizer: Summarizer, **kwargs) DocSet [source]#
Applies the Summarize transform on the Docset.
Example
llm_model = OpenAILanguageModel("gpt-3.5-turbo") element_operator = my_element_selector # A custom element selection function summarizer = LLMElementTextSummarizer(llm_model, element_operator) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .summarize(summarizer=summarizer)
- take(limit: int = 20) list[Document] [source]#
Returns up to
limit
documents from the dataset.- Parameters:
limit – The maximum number of Documents to return.
- Returns:
A list of up to
limit
Documents from the Docset.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .take()
- take_all(limit: int | None = None) list[Document] [source]#
Returns all of the rows in this DocSet.
If limit is set, this method will raise an error if this Docset has more than limit Documents.
- Parameters:
limit – The number of Documents above which this method will raise an error.
- transform(cls: Type[Transform], **kwargs) DocSet [source]#
Add specified transform class to pipeline. See the API reference section on transforms.
- Parameters:
cls – Class of transform to instantiate into pipeline
... – Other keyword arguments are passed to class constructor
Example
from sycamore.transforms import FooBar ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) .transform(cls=FooBar, arg=123)
- with_properties(property_map: Mapping[str, Callable[[Document], Any]], **resource_args) DocSet [source]#
Adds multiple properties to each Document.
- Parameters:
property_map – A mapping of property names to functions to generate those properties
Example
docset.with_properties({ "text_size": lambda doc: len(doc.text_representation), "truncated_text": lambda doc: doc.text_representation[0:256] })
- with_property(name, f: Callable[[Document], Any], **resource_args) DocSet [source]#
Applies a function to each document and adds the result as a property.
- Parameters:
name – The name of the property to add to each Document.
f – The function to apply to each Document.
Example
To add a property that contains the length of the text representation as a new property: .. code-block:: python
docset.with_property(“text_size”, lambda doc: len(doc.text_representation))
- property write: DocSetWriter#
Exposes an interface for writing a DocSet to OpenSearch or other external storage. See
DocSetWriter
for more information about writers and their arguments.Example
The following example shows reading a DocSet from a collection of PDFs, partitioning it using the
SycamorePartitioner
, and then writing it to a new OpenSearch index.os_client_args = { "hosts": [{"host": "localhost", "port": 9200}], "http_auth": ("user", "password"), } index_settings = { "body": { "settings": { "index.knn": True, }, "mappings": { "properties": { "embedding": { "type": "knn_vector", "dimension": 384, "method": {"name": "hnsw", "engine": "faiss"}, }, }, }, }, } context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=SycamorePartitioner()) pdf.write.opensearch( os_client_args=os_client_args, index_name="my_index", index_settings=index_settings)