DocSet#

class sycamore.docset.DocSet(context: Context, plan: Node)[source]#

A DocSet, short for “Document Set”, is a distributed collection of documents bundled together for processing. Sycamore provides a variety of transformations on DocSets to help customers handle unstructured data easily.

augment_text(augmentor: TextAugmentor, **resource_args) DocSet[source]#

Augments text_representation with external information.

Parameters:

augmentor (TextAugmentor) -- A TextAugmentor instance that defines how to augment the text

Example

augmentor = FStringTextAugmentor(sentences = [
    "This pertains to the part {doc.properties['part_name']}.",
    "{doc.text_representation}"
])
entity_extractor = OpenAIEntityExtractor("part_name",
                            llm=openai_llm,
                            prompt_template=part_name_template)
context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_entity(entity_extractor)
    .explode()
    .augment_text(augmentor)
clear_materialize(path: Path | str | None = None, *, clear_non_local=False) None[source]#

Deletes all of the materialized files referenced by the docset.

path will use PurePath.match to check if the specified path matches against the directory used for each materialize transform. Only matching directories will be cleared.

Set clear_non_local=True to clear non-local filesystems. Note filesystems like NFS/CIFS will count as local. pyarrow.fs.SubTreeFileSystem is treated as non_local.

count(include_metadata=False, **kwargs) int[source]#

Counts the number of documents in the resulting dataset. It is a convenient way to determine the size of the dataset generated by the plan.

Parameters:
  • include_metadata -- Determines whether or not to count MetaDataDocuments

  • **kwargs --

Returns:

The number of documents in the docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .count()
count_distinct(field: str, **kwargs) int[source]#

Counts the number of documents in the resulting dataset with a unique value for field.

Args: field: Field (in dotted notation) to perform a unique count based on. **kwargs

Returns:

The number of documents with a unique value for field.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .count("doc_id")
embed(embedder: Embedder, **kwargs) DocSet[source]#

Applies the Embed transform on the Docset.

Parameters:

embedder -- An instance of an Embedder class that defines the embedding method to be applied.

Example

model_name="sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .explode()
    .embed(embedder=embedder)
execute(**kwargs) None[source]#

Execute the pipeline, discard the results. Useful for side effects.

explode(**resource_args) DocSet[source]#

Applies the Explode transform on the Docset.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .explode()
extract_batch_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet[source]#

Extracts a common schema from the documents in this DocSet.

This transform is similar to extract_schema, except that it will add the same schema to each document in the DocSet rather than infering a separate schema per Document. This is most suitable for document collections that share a common format. If you have a heterogeneous document collection and want a different schema for each type, consider using extract_schema instead.

Parameters:

schema_extractor -- A SchemaExtractor instance to extract the schema for each document.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_batch_schema(schema_extractor=schema_extractor)
extract_document_structure(structure: DocumentStructure, **kwargs)[source]#

Represents documents as Hierarchical documents organized by their structure.

Parameters:

structure -- A instance of DocumentStructure which determines how documents are organized

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_document_structure(structure=StructureBySection)
    .explode()
extract_entity(entity_extractor: EntityExtractor, **kwargs) DocSet[source]#

Applies the ExtractEntity transform on the Docset.

Parameters:

entity_extractor -- An instance of an EntityExtractor class that defines the entity extraction method to be applied.

Example

title_context_template = "template"

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
entity_extractor = OpenAIEntityExtractor("title",
                       llm=openai_llm,
                       prompt_template=title_context_template)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_entity(entity_extractor=entity_extractor)
extract_graph_entities(extractors: list[GraphEntityExtractor] = [], **kwargs) DocSet[source]#

Extracts entites from document children. Entities are stored as nodes within each child of a document.

Parameters:

extractors -- A list of GraphEntityExtractor objects which determines how entities are extracted

Example

extract_graph_relationships(extractors: list[GraphRelationshipExtractor] = [], **kwargs) DocSet[source]#

Extracts relationships from document children. Relationships are stored within the nodes they reference within each child of a document.

Parameters:

extractors -- A list of GraphEntityExtractor objects which determines how relationships are extracted

Example

extract_properties(property_extractor: PropertyExtractor, **kwargs) DocSet[source]#

Extracts properties from each Document in this DocSet based on the _schema property.

The schema can be computed using extract_schema or extract_batch_schema or can be provided manually in JSON-schema format in the _schema field under Document.properties.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
property_extractor = OpenAIPropertyExtractor(OpenaAIPropertyExtrator(llm=openai_llm))

context = sycamore.init()

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partition=ArynPartitioner())
    .extract_properties(property_extractor)
extract_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet[source]#

Extracts a JSON schema of extractable properties from each document in this DocSet.

Each schema is a mapping of names to types that corresponds to fields that are present in the document. For example, calling this method on a financial document containing information about companies might yield a schema like

{
  "company_name": "string",
  "revenue": "number",
  "CEO": "string"
}

This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.

Parameters:

schema_extractor -- A SchemaExtractor instance to extract the schema for each document.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_schema(schema_extractor=schema_extractor)
field_in(docset2: DocSet, field1: str, field2: str, **kwargs) DocSet[source]#

Joins two docsets based on specified fields; docset (self) filtered based on values of docset2.

SQL Equivalent: SELECT * FROM docset1 WHERE field1 IN (SELECT field2 FROM docset2);

Parameters:
  • docset2 -- DocSet to filter.

  • field1 -- Field in docset1 to filter based on.

  • field2 -- Field in docset2 to filter.

Returns:

A left semi-join between docset (self) and docset2.

filter(f: Callable[[Document], bool], **kwargs) DocSet[source]#

Applies the Filter transform on the Docset.

Parameters:

f -- A callable function that takes a Document object and returns a boolean indicating whether the document should be included in the filtered Docset.

Example

def custom_filter(doc: Document) -> bool:
    # Define your custom filtering logic here.
    return doc.some_property == some_value

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .filter(custom_filter)
filter_elements(f: Callable[[Element], bool], **resource_args) DocSet[source]#

Applies the given filter function to each element in each Document in this DocsSet.

Parameters:

f -- A Callable that takes an Element and returns True if the element should be retained.

flat_map(f: Callable[[Document], list[Document]], **resource_args) DocSet[source]#

Applies the FlatMap transformation on the Docset.

Parameters:

f -- The function to apply to each document.

See the FlatMap documentation for advanced features.

Example

def custom_flat_mapping_function(document: Document) -> list[Document]:
    # Custom logic to transform the document and return a list of documents
    return [transformed_document_1, transformed_document_2]

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
.partition(partitioner=ArynPartitioner())
.flat_map(custom_flat_mapping_function)
groupby_count(field: str, unique_field: str | None = None, **kwargs) DocSet[source]#

Performs a count aggregation on a DocSet.

Parameters:
  • field -- Field to aggregate based on.

  • unique_field -- Determines what makes a unique document.

  • **kwargs --

Returns:

A DocSet with "properties.key" (unique values of document field) and "properties.count" (frequency counts for unique values).

limit(limit: int = 20, **kwargs) DocSet[source]#

Applies the Limit transforms on the Docset.

Parameters:

limit -- The maximum number of documents to include in the resulting Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .explode()
    .limit()
llm_cluster_entity(llm: LLM, instruction: str, field: str, **kwargs) DocSet[source]#

Normalizes a particular field of a DocSet. Identifies and assigns each document to a "group".

Parameters:
  • llm -- LLM client.

  • instruction -- Instruction about groups to form, e.g. 'Form groups for different types of food'

  • field -- Field to make/assign groups based on, e.g. 'properties.entity.food'

Returns:

A DocSet with an additional field "properties._autogen_ClusterAssignment" that contains the assigned group. For example, if "properties.entity.food" has values 'banana', 'milk', 'yogurt', 'chocolate', 'orange', "properties._autogen_ClusterAssignment" would contain values like 'fruit', 'dairy', and 'dessert'.

llm_filter(llm: LLM, new_field: str, prompt: list[dict] | str, field: str = 'text_representation', threshold: int = 3, keep_none: bool = False, use_elements: bool = False, similarity_query: str | None = None, similarity_scorer: SimilarityScorer | None = None, **resource_args) DocSet[source]#

Filters DocSet to only keep documents that score (determined by LLM) greater than or equal to the inputted threshold value.

Parameters:
  • llm -- LLM to use.

  • new_field -- The field that will be added to the DocSet with the outputs.

  • prompt -- LLM prompt.

  • field -- Document field to filter based on.

  • threshold -- If the value of the computed result is an integer value greater than or equal to this threshold, the document will be kept.

  • keep_none -- keep records with a None value for the provided field to filter on. Warning: using this might hide data corruption issues.

  • use_elements -- use contents of a document's elements to filter as opposed to document level contents.

  • similarity_query -- query string to compute similarity against. Also requires a 'similarity_scorer'.

  • similarity_scorer -- scorer used to generate similarity scores used in element sorting. Also requires a 'similarity_query'.

  • **resource_args --

Returns:

A filtered DocSet.

llm_query(query_agent: LLMTextQueryAgent, **kwargs) DocSet[source]#

Executes an LLM Query on a specified field (element or document), and returns the response

Parameters:
  • prompt -- A prompt to be passed into the underlying LLM execution engine

  • llm -- The LLM Client to be used here. It is defined as an instance of the LLM class in Sycamore.

  • output_property -- (Optional, default="llm_response") The output property of the document or element to add results in.

  • format_kwargs -- (Optional, default="None") If passed in, details the formatting details that must be passed into the underlying Jinja Sandbox.

  • number_of_elements -- (Optional, default="None") When "per_element" is true, limits the number of elements to add an "output_property". Otherwise, the response is added to the entire document using a limited prefix subset of the elements.

  • llm_kwargs -- (Optional) LLM keyword argument for the underlying execution engine

  • per_element -- (Optional, default="{}") Keyword arguments to be passed into the underlying LLM execution engine.

  • element_type -- (Optional) Parameter to only execute the LLM query on a particular element type. If not specified, the query will be executed on all elements.

map(f: Callable[[Document], Document], **resource_args) DocSet[source]#

Applies the Map transformation on the Docset.

Parameters:

f -- The function to apply to each document.

See the Map documentation for advanced features.

map_batch(f: Callable[[list[Document]], list[Document]], f_args: Iterable[Any] | None = None, f_kwargs: dict[str, Any] | None = None, f_constructor_args: Iterable[Any] | None = None, f_constructor_kwargs: dict[str, Any] | None = None, **resource_args) DocSet[source]#

The map_batch transform is similar to map, except that it processes a list of documents and returns a list of documents. map_batch is ideal for transformations that get performance benefits from batching.

See the MapBatch documentation for advanced features.

Example

def custom_map_batch_function(documents: list[Document]) -> list[Document]:
    # Custom logic to transform the documents
    return transformed_documents

map_ds = input_ds.map_batch(f=custom_map_batch_function)

def CustomMappingClass():
    def __init__(self, arg1, arg2, *, kw_arg1=None, kw_arg2=None):
        self.arg1 = arg1
        # ...

    def _process(self, doc: Document) -> Document:
        doc.properties["arg1"] = self.arg1
        return doc

    def __call__(self, docs: list[Document], fnarg1, *, fnkwarg1=None) -> list[Document]:
        return [self._process(d) for d in docs]

map_ds = input_ds.map_batch(f=CustomMappingClass,
                            f_args=["fnarg1"], f_kwargs={"fnkwarg1": "stuff"},
                            f_constructor_args=["arg1", "arg2"],
                            f_constructor_kwargs={"kw_arg1": 1, "kw_arg2": 2})
map_elements(f: Callable[[Element], Element], **resource_args) DocSet[source]#

Applies the given mapping function to each element in the each Document in this DocsSet.

Parameters:

f -- A Callable that takes an Element and returns an Element. Elements for which f evaluates to None are dropped.

mark_bbox_preset(tokenizer: Tokenizer, token_limit: int = 512, **kwargs) DocSet[source]#
Convenience composition of:

SortByPageBbox MarkDropTiny minimum=2 MarkDropHeaderFooter top=0.05 bottom=0.05 MarkBreakPage MarkBreakByColumn MarkBreakByTokens limit=512

Meant to work in concert with MarkedMerger.

markdown(**kwargs) DocSet[source]#

Modifies Document to have a single Element containing the Markdown representation of all the original elements.

Example

context = sycamore.init()
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .markdown()
    .explode()
materialize(path: Path | str | dict | None = None, source_mode: MaterializeSourceMode = MaterializeSourceMode.RECOMPUTE) DocSet[source]#

The materialize transform writes out documents up to that point, marks the materialized path as successful if execution is successful, and allows for reading from the materialized data as a source. This transform is helpful if you are using show and take() as part of a notebook to incrementally inspect output. You can use materialize to avoid re-computation.

path: a Path or string represents the "directory" for the materialized elements. The filesystem

and naming convention will be inferred. The dictionary variant allowes finer control, and supports { root=Path|str, fs=pyarrow.fs, name=lambda Document -> str, clean=True,

tobin=Document.serialize()}

root is required

source_mode: how this materialize step should be used as an input:
RECOMPUTE: (default) the transform does not act as a source, previous transforms

will be recomputed.

USE_STORED: If the materialize has successfully run to completion, or if the

materialize step has no prior step, use the stored contents of the directory as the inputs. No previous transform will be computed. WARNING: If you change the input files or any of the steps before the materialize step, you need to use clear_materialize() or change the source_mode to force re-execution.

Note: you can write the source mode as MaterializeSourceMode.SOMETHING after importing MaterializeSourceMode, or as sycamore.MATERIALIZE_SOMETHING after importing sycamore.

merge(merger: ElementMerger, **kwargs) DocSet[source]#

Applies merge operation on each list of elements of the Docset

Example

from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
merger = GreedyElementMerger(tk, 512)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .merge(merger=merger)
partition(partitioner: Partitioner, table_extractor: TableExtractor | None = None, **kwargs) DocSet[source]#

Applies the Partition transform on the Docset.

More information can be found in the Partition documentation.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
query(query_executor: QueryExecutor, **resource_args) DocSet[source]#

Applies a query execution transform on a DocSet of queries.

Parameters:

query_executor -- Implementation for the query execution.

random_sample(fraction: float, seed: int | None = None) DocSet[source]#

Retain a random sample of documents from this DocSet.

The number of documents in the output will be approximately fraction * self.count()

Parameters:
  • fraction -- The fraction of documents to retain.

  • seed -- Optional seed to use for the RNG.

regex_replace(spec: list[tuple[str, str]], **kwargs) DocSet[source]#

Performs regular expression replacement (using re.sub()) on the text_representation of every Element in each Document.

Example

from sycamore.transforms import COALESCE_WHITESPACE
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .regex_replace(COALESCE_WHITESPACE)
    .regex_replace([(r"\d+", "1313"), (r"old", "new")])
    .explode()
rerank(similarity_scorer: SimilarityScorer, query: str, score_property_name: str = '_rerank_score', limit: int | None = None, **kwargs) DocSet[source]#

Sort a DocSet given a scoring class.

Parameters:
  • similarity_scorer -- An instance of an SimilarityScorer class that executes the scoring function.

  • query -- The query string to compute similarity against.

  • score_property_name -- The name of the key where the score will be stored in document.properties

  • limit -- Limit scoring and sorting to fixed size.

resolve_graph_entities(resolvers: list[EntityResolver] = [], resolve_duplicates=True, **kwargs) DocSet[source]#

Resolves graph entities across documents so that duplicate entities can be resolved to the same entity based off criteria of EntityResolver objects.

Parameters:
  • resolvers -- A list of EntityResolvers that are used to determine what entities are duplicates

  • resolve_duplicates -- If exact duplicate entities and relationships should be merged. Defaults to true

Example

show(limit: int = 20, show_elements: bool = True, num_elements: int = -1, show_binary: bool = False, show_embedding: bool = False, truncate_content: bool = True, truncate_length: int = 100, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None[source]#

Prints the content of the docset in a human-readable format. It is useful for debugging and inspecting the contents of objects during development.

Parameters:
  • limit -- The maximum number of items to display.

  • show_elements -- Whether to display individual elements or not.

  • num_elements -- The number of elements to display. Use -1 to show all elements.

  • show_binary -- Whether to display binary data or not.

  • show_embedding -- Whether to display embedding information or not.

  • truncate_content -- Whether to truncate long content when displaying.

  • truncate_length -- The maximum length of content to display when truncating.

  • stream -- The output stream where the information will be displayed.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .show()
sketch(window: int = 17, number: int = 16, **kwargs) DocSet[source]#

For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents' sketches can be compared to determine if they have near-duplicate content.

Parameters:
  • window -- Number of bytes in the sliding window that is hashed (17)

  • number -- Count of hashes comprising a shingle (16)

Example

ds = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=ArynPartitioner())
     .explode()
     .sketch(window=17)
sort(descending: bool, field: str, default_val: Any | None = None) DocSet[source]#

Sort DocSet by specified field.

Parameters:
  • descending -- Whether or not to sort in descending order (first to last).

  • field -- Document field in relation to Document using dotted notation, e.g. properties.filetype

  • default_val -- Default value to use if field does not exist in Document

split_elements(tokenizer: Tokenizer, max_tokens: int = 512, **kwargs) DocSet[source]#

Splits elements if they are larger than the maximum number of tokens.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=ArynPartitioner())
     .split_elements(tokenizer=tokenizer, max_tokens=512)
     .explode()
spread_properties(props: list[str], **resource_args) DocSet[source]#

Copies listed properties from parent document to child elements.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=ArynPartitioner())
     .spread_properties(["title"])
     .explode()
summarize(summarizer: Summarizer, **kwargs) DocSet[source]#

Applies the Summarize transform on the Docset.

Example

llm_model = OpenAILanguageModel("gpt-3.5-turbo")
element_operator = my_element_selector  # A custom element selection function
summarizer = LLMElementTextSummarizer(llm_model, element_operator)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .summarize(summarizer=summarizer)
take(limit: int = 20, include_metadata: bool = False, **kwargs) list[Document][source]#

Returns up to limit documents from the dataset.

Parameters:

limit -- The maximum number of Documents to return.

Returns:

A list of up to limit Documents from the Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .take()
take_all(limit: int | None = None, include_metadata: bool = False, **kwargs) list[Document][source]#

Returns all of the rows in this DocSet.

If limit is set, this method will raise an error if this Docset has more than limit Documents.

Parameters:

limit -- The number of Documents above which this method will raise an error.

term_frequency(tokenizer: Tokenizer, with_token_ids: bool = False, **kwargs) DocSet[source]#

For each document, compute a frequency table over the text representation, as tokenized by tokenizer. Use to enable hybrid search in Pinecone

Example

tk = OpenAITokenizer("gpt-3.5-turbo")
context = sycamore.init()
context.read.binary(paths, binary_format="pdf")
    .partition(ArynPartitioner())
    .explode()
    .term_frequency(tokenizer=tk)
    .show()
top_k(llm: LLM | None, field: str, k: int | None, descending: bool = True, llm_cluster: bool = False, unique_field: str | None = None, llm_cluster_instruction: str | None = None, **kwargs) DocSet[source]#

Determines the top k occurrences for a document field.

Parameters:
  • llm -- LLM client.

  • field -- Field to determine top k occurrences of.

  • k -- Number of top occurrences. If k is not specified, all occurences are returned.

  • llm_cluster_instruction -- Instruction of operation purpose. E.g. Find most common cities

  • descending -- Indicates whether to return most or least frequent occurrences.

  • llm_cluster -- Indicates whether an LLM should be used to normalize values of document field.

  • unique_field -- Determines what makes a unique document.

  • **kwargs --

Returns:

A DocSet with "properties.key" (unique values of document field) and "properties.count" (frequency counts for unique values) which is sorted based on descending and contains k records.

transform(cls: Type[Transform], **kwargs) DocSet[source]#

Add specified transform class to pipeline. See the API reference section on transforms.

Parameters:
  • cls -- Class of transform to instantiate into pipeline

  • ... -- Other keyword arguments are passed to class constructor

Example

from sycamore.transforms import FooBar
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .transform(cls=FooBar, arg=123)
with_properties(property_map: Mapping[str, Callable[[Document], Any]], **resource_args) DocSet[source]#

Adds multiple properties to each Document.

Parameters:

property_map -- A mapping of property names to functions to generate those properties

Example

docset.with_properties({
    "text_size": lambda doc: len(doc.text_representation),
    "truncated_text": lambda doc: doc.text_representation[0:256]
})
with_property(name, f: Callable[[Document], Any], **resource_args) DocSet[source]#

Applies a function to each document and adds the result as a property.

Parameters:
  • name -- The name of the property to add to each Document.

  • f -- The function to apply to each Document.

Example

To add a property that contains the length of the text representation as a new property: .. code-block:: python

docset.with_property("text_size", lambda doc: len(doc.text_representation))

property write: DocSetWriter#

Exposes an interface for writing a DocSet to OpenSearch or other external storage. See DocSetWriter for more information about writers and their arguments.

Example

The following example shows reading a DocSet from a collection of PDFs, partitioning it using the ArynPartitioner, and then writing it to a new OpenSearch index.

os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_auth": ("user", "password"),
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
        },
        "mappings": {
            "properties": {
                "embedding": {
                    "type": "knn_vector",
                    "dimension": 384,
                    "method": {"name": "hnsw", "engine": "faiss"},
                },
            },
        },
    },
}

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())

pdf.write.opensearch(
     os_client_args=os_client_args,
     index_name="my_index",
     index_settings=index_settings)