DocSet#

class sycamore.docset.DocSet(context: Context, plan: Node)[source]#

A DocSet, short for “documentation set,” is a distributed collection of documents bundled together for processing. Sycamore provides a variety of transformations on DocSets to help customers handle unstructured data easily.

augment_text(augmentor: TextAugmentor, **resource_args) DocSet[source]#

Augments text_representation with external information.

Parameters:

augmentor (TextAugmentor) – A TextAugmentor instance that defines how to augment the text

Example

augmentor = FStringTextAugmentor(sentences = [
    "This pertains to the part {doc.properties['part_name']}.",
    "{doc.text_representation}"
])
entity_extractor = OpenAIEntityExtractor("part_name",
                            llm=openai_llm,
                            prompt_template=part_name_template)
context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .extract_entity(entity_extractor)
    .explode()
    .augment_text(augmentor)
count() int[source]#

Counts the number of documents in the resulting dataset. It is a convenient way to determine the size of the dataset generated by the plan.

Returns:

The number of documents in the docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .count()
embed(embedder: Embedder, **kwargs) DocSet[source]#

Applies the Embed transform on the Docset.

Parameters:

embedder – An instance of an Embedder class that defines the embedding method to be applied.

Example

model_name="sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .explode()
    .embed(embedder=embedder)
explode(**resource_args) DocSet[source]#

Applies the Explode transform on the Docset.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .explode()
extract_batch_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet[source]#

Extracts a common schema from the documents in this DocSet.

This transform is similar to extract_schema, except that it will add the same schema to each document in the DocSet rather than infering a separate schema per Document. This is most suitable for document collections that share a common format. If you have a heterogeneous document collection and want a different schema for each type, consider using extract_schema instead.

Parameters:

schema_extractor – A SchemaExtractor instance to extract the schema for each document.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .extract_batch_schema(schema_extractor=schema_extractor)
extract_entity(entity_extractor: EntityExtractor, **kwargs) DocSet[source]#

Applies the ExtractEntity transform on the Docset.

Parameters:

entity_extractor – An instance of an EntityExtractor class that defines the entity extraction method to be applied.

Example

title_context_template = "template"

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
entity_extractor = OpenAIEntityExtractor("title",
                       llm=openai_llm,
                       prompt_template=title_context_template)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .extract_entity(entity_extractor=entity_extractor)
extract_properties(property_extractor: PropertyExtractor, **kwargs) DocSet[source]#

Extracts properties from each Document in this DocSet based on the _schema property.

The schema can be computed using extract_schema or extract_batch_schema or can be provided manually in JSON-schema format in the _schema field under Document.properties.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
property_extractor = OpenAIPropertyExtractor(OpenaAIPropertyExtrator(llm=openai_llm))

context = sycamore.init()

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partition=UnstructuredPdfPartitioner())
    .extract_properties(property_extractor)
extract_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet[source]#

Extracts a JSON schema of extractable properties from each document in this DocSet.

Each schema is a mapping of names to types that corresponds to fields that are present in the document. For example, calling this method on a financial document containing information about companies might yield a schema like

{
  "company_name": "string",
  "revenue": "number",
  "CEO": "string"
}

This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.

Parameters:

schema_extractor – A SchemaExtractor instance to extract the schema for each document.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .extract_schema(schema_extractor=schema_extractor)
filter(f: Callable[[Document], bool], **resource_args) DocSet[source]#

Applies the Filter transform on the Docset.

Parameters:

f – A callable function that takes a Document object and returns a boolean indicating whether the document should be included in the filtered Docset.

Example

def custom_filter(doc: Document) -> bool:
    # Define your custom filtering logic here.
    return doc.some_property == some_value

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .filter(custom_filter)
flat_map(f: Callable[[Document], list[Document]], **resource_args) DocSet[source]#

Applies the FlatMap transformation on the Docset.

Parameters:

f – The function to apply to each document.

Example

def custom_flat_mapping_function(document: Document) -> list[Document]:
    # Custom logic to transform the document and return a list of documents
    return [transformed_document_1, transformed_document_2]

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
.partition(partitioner=UnstructuredPdfPartitioner())
.flat_map(custom_flat_mapping_function)
limit(limit: int = 20) DocSet[source]#

Applies the Limit transforms on the Docset.

Parameters:

limit – The maximum number of documents to include in the resulting Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .explode()
    .limit()
map(f: Callable[[Document], Document], **resource_args) DocSet[source]#

Applies the Map transformation on the Docset.

Parameters:

f – The function to apply to each document.

map_batch(f: Callable[[list[Document]], list[Document]], f_args: Iterable[Any] | None = None, f_kwargs: dict[str, Any] | None = None, f_constructor_args: Iterable[Any] | None = None, f_constructor_kwargs: dict[str, Any] | None = None, **resource_args) DocSet[source]#

The map_batch transform is similar to map, except that it processes a list of documents and returns a list of documents. map_batch is ideal for transformations that get performance benefits from batching.

Example

def custom_map_batch_function(documents: list[Document]) -> list[Document]:
    # Custom logic to transform the documents
    return transformed_documents

map_ds = input_ds.map_batch(f=custom_map_batch_function)

def CustomMappingClass():
    def __init__(self, arg1, arg2, *, kw_arg1=None, kw_arg2=None):
        self.arg1 = arg1
        # ...

    def _process(self, doc: Document) -> Document:
        doc.properties["arg1"] = self.arg1
        return doc

    def __call__(self, docs: list[Document], fnarg1, *, fnkwarg1=None) -> list[Document]:
        return [self._process(d) for d in docs]

map_ds = input_ds.map_batch(f=CustomMappingClass,
                            f_args=["fnarg1"], f_kwargs={"fnkwarg1": "stuff"},
                            f_constructor_args=["arg1", "arg2"],
                            f_constructor_kwargs={"kw_arg1": 1, "kw_arg2": 2})
map_elements(f: Callable[[Element], Element], **resource_args) DocSet[source]#

Applies the given mapping function to each element in the each Document in this DocsSet.

Parameters:

f – A Callable that takes an Element and returns an Element. Elements for which f evaluates to None are dropped.

mark_bbox_preset(tokenizer: Tokenizer, token_limit: int = 512, **kwargs) DocSet[source]#
Convenience composition of:

SortByPageBbox MarkDropTiny minimum=2 MarkDropHeaderFooter top=0.05 bottom=0.05 MarkBreakPage MarkBreakByColumn MarkBreakByTokens limit=512

Meant to work in concert with MarkedMerger.

merge(merger: ElementMerger, **kwargs) DocSet[source]#

Applies merge operation on each list of elements of the Docset

Example

from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
merger = GreedyElementMerger(tk, 512)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .merge(merger=merger)
partition(partitioner: Partitioner, table_extractor: TableExtractor | None = None, **kwargs) DocSet[source]#

Applies the Partition transform on the Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
query(query_executor: QueryExecutor, **resource_args) DocSet[source]#

Applies a query execution transform on a DocSet of queries.

Parameters:

query_executor – Implementation for the query execution.

random_sample(fraction: float, seed: int | None = None) DocSet[source]#

Retain a random sample of documents from this DocSet.

The number of documents in the output will be approximately fraction * self.count()

Parameters:
  • fraction – The fraction of documents to retain.

  • seed – Optional seed to use for the RNG.

regex_replace(spec: list[tuple[str, str]], **kwargs) DocSet[source]#

Performs regular expression replacement (using re.sub()) on the text_representation of every Element in each Document.

Example

from sycamore.transforms import COALESCE_WHITESPACE
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .regex_replace(COALESCE_WHITESPACE)
    .regex_replace([(r"\d+", "1313"), (r"old", "new")])
    .explode()
show(limit: int = 20, show_elements: bool = True, num_elements: int = -1, show_binary: bool = False, show_embedding: bool = False, truncate_content: bool = True, truncate_length: int = 100, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None[source]#

Prints the content of the docset in a human-readable format. It is useful for debugging and inspecting the contents of objects during development.

Parameters:
  • limit – The maximum number of items to display.

  • show_elements – Whether to display individual elements or not.

  • num_elements – The number of elements to display. Use -1 to show all elements.

  • show_binary – Whether to display binary data or not.

  • show_embedding – Whether to display embedding information or not.

  • truncate_content – Whether to truncate long content when displaying.

  • truncate_length – The maximum length of content to display when truncating.

  • stream – The output stream where the information will be displayed.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .show()
sketch(window: int = 17, number: int = 16, **kwargs) DocSet[source]#

For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents’ sketches can be compared to determine if they have near-duplicate content.

Parameters:
  • window – Number of bytes in the sliding window that is hashed (17)

  • number – Count of hashes comprising a shingle (16)

Example

ds = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=UnstructuredPdfPartitioner())
     .explode()
     .sketch(window=17)
split_elements(tokenizer: Tokenizer, max_tokens: int = 512, **kwargs) DocSet[source]#

Splits elements if they are larger than the maximum number of tokens.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=UnstructuredPdfPartitioner())
     .split_elements(tokenizer=tokenizer, max_tokens=512)
     .explode()
spread_properties(props: list[str], **resource_args) DocSet[source]#

Copies listed properties from parent document to child elements.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=UnstructuredPdfPartitioner())
     .spread_properties(["title"])
     .explode()
summarize(summarizer: Summarizer, **kwargs) DocSet[source]#

Applies the Summarize transform on the Docset.

Example

llm_model = OpenAILanguageModel("gpt-3.5-turbo")
element_operator = my_element_selector  # A custom element selection function
summarizer = LLMElementTextSummarizer(llm_model, element_operator)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .summarize(summarizer=summarizer)
take(limit: int = 20) list[Document][source]#

Returns up to limit documents from the dataset.

Parameters:

limit – The maximum number of Documents to return.

Returns:

A list of up to limit Documents from the Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .take()
take_all(limit: int | None = None) list[Document][source]#

Returns all of the rows in this DocSet.

If limit is set, this method will raise an error if this Docset has more than limit Documents.

Parameters:

limit – The number of Documents above which this method will raise an error.

transform(cls: Type[Transform], **kwargs) DocSet[source]#

Add specified transform class to pipeline. See the API reference section on transforms.

Parameters:
  • cls – Class of transform to instantiate into pipeline

  • ... – Other keyword arguments are passed to class constructor

Example

from sycamore.transforms import FooBar
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .transform(cls=FooBar, arg=123)
with_properties(property_map: Mapping[str, Callable[[Document], Any]], **resource_args) DocSet[source]#

Adds multiple properties to each Document.

Parameters:

property_map – A mapping of property names to functions to generate those properties

Example

docset.with_properties({
    "text_size": lambda doc: len(doc.text_representation),
    "truncated_text": lambda doc: doc.text_representation[0:256]
})
with_property(name, f: Callable[[Document], Any], **resource_args) DocSet[source]#

Applies a function to each document and adds the result as a property.

Parameters:
  • name – The name of the property to add to each Document.

  • f – The function to apply to each Document.

Example

To add a property that contains the length of the text representation as a new property: .. code-block:: python

docset.with_property(“text_size”, lambda doc: len(doc.text_representation))

property write: DocSetWriter#

Exposes an interface for writing a DocSet to OpenSearch or other external storage. See DocSetWriter for more information about writers and their arguments.

Example

The following example shows reading a DocSet from a collection of PDFs, partitioning it using the UnstructuredPdfPartitioner, and then writing it to a new OpenSearch index.

os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_auth": ("user", "password"),
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
        },
        "mappings": {
            "properties": {
                "embedding": {
                    "type": "knn_vector",
                    "dimension": 384,
                    "method": {"name": "hnsw", "engine": "faiss"},
                },
            },
        },
    },
}

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())

pdf.write.opensearch(
     os_client_args=os_client_args,
     index_name="my_index",
     index_settings=index_settings)