DocSet¶

class sycamore.docset.DocSet(context: Context, plan: Node)[source]¶

A DocSet, short for “Document Set”, is a lazy pipeline that can be processed in a distributed manner. It starts with a read step, and is followed by a series of transformation on the DocSet.

Sycamore provides a variety of transformations on DocSets to help customers modify unstructured data easily. Sycamore also provides a variety of readers and writers to start and finish a pipeline.

apply(f: Callable[[Document], Any], **kwargs) → DocSet[source]¶

Applies a function to all documents in the docset. Returns the input documents, so f is useful for functions that do in-place mutations.

Parameters:: f -- The function to apply to each document. Return values are dropped.

augment_text(augmentor: TextAugmentor, **resource_args) → DocSet[source]¶

Augments text_representation with external information.
Args:
augmentor (TextAugmentor): A TextAugmentor instance that defines how to augment the text

Example:
augmentor = UDFTextAugmentor(
    lambda doc: f"This pertains to the part {doc.properties['part_name']}.

{doc.text_representation}"

) entity_extractor = OpenAIEntityExtractor("part_name",

llm=openai_llm, prompt_template=part_name_template)

context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf")

.partition(partitioner=ArynPartitioner()) .extract_entity(entity_extractor) .explode() .augment_text(augmentor)

clear_materialize(path: Path | str | None = None, *, clear_non_local=False) → None[source]¶

Deletes all of the materialized files referenced by the docset.

path will use PurePath.match to check if the specified path matches against the directory used for each materialize transform. Only matching directories will be cleared.

Set clear_non_local=True to clear non-local filesystems. Note filesystems like NFS/CIFS will count as local. pyarrow.fs.SubTreeFileSystem is treated as non_local.

count(include_metadata=False, **kwargs) → int[source]¶

Counts the number of documents in the resulting dataset. It is a convenient way to determine the size of the dataset generated by the plan.

Parameters:

include_metadata -- Determines whether or not to count MetaDataDocuments
**kwargs

Returns:

The number of documents in the docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .count()

count_distinct(field: str, **kwargs) → int[source]¶

Counts the number of documents in the resulting dataset with a unique value for field.

Args: field: Field (in dotted notation) to perform a unique count based on. **kwargs

Returns:: The number of documents with a unique value for field.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .count("doc_id")

embed(embedder: Embedder, **kwargs) → DocSet[source]¶

Applies the Embed transform on the Docset.

Parameters:: embedder -- An instance of an Embedder class that defines the embedding method to be applied.

Example

model_name="sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .explode()
    .embed(embedder=embedder)

execute(**kwargs) → None[source]¶

Execute the pipeline and discard the results. This method is primarily used for pipelines that produce side effects, such as materializing data to disk.

Reliability mode is automatically enabled when: - The pipeline ends with a Materialize node and the start of the pipeline is a read node. - A MaterializeReadReliability rule is present in the context's rewrite rules

# Standard execution ctx = sycamore.init() ds = ctx.read.... ds.execute() # Runs without reliability guarantees

# Reliable execution with materialize read

ctx = sycamore.init() ctx.rewrite_rules.append(MaterializeReadReliability(max_batch=200, max_retries=20)) ds = ctx.read.materialize() ... .materialize() ds.execute() # Runs with batching, retries, and progress tracking

# Reliable execution with binary read

ctx = sycamore.init() ctx.rewrite_rules.append(MaterializeReadReliability(max_batch=200, max_retries=20)) ds = ctx.read.binary() ... .materialize() ds.execute() # Runs with batching, retries, and progress tracking

For more details, see the MaterializeReadReliability class.

explode(**resource_args) → DocSet[source]¶

Applies the Explode transform on the Docset.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .explode()

extract_batch_schema(schema_extractor: SchemaExtractor, **kwargs) → DocSet[source]¶

Extracts a common schema from the documents in this DocSet.

This transform is similar to extract_schema, except that it will add the same schema to each document in the DocSet rather than infering a separate schema per Document. This is most suitable for document collections that share a common format. If you have a heterogeneous document collection and want a different schema for each type, consider using extract_schema instead.

Parameters:: schema_extractor -- A SchemaExtractor instance to extract the schema for each document.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_batch_schema(schema_extractor=schema_extractor)

extract_document_structure(structure: DocumentStructure, **kwargs)[source]¶

Represents documents as Hierarchical documents organized by their structure.

Parameters:: structure -- A instance of DocumentStructure which determines how documents are organized

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_document_structure(structure=StructureBySection)
    .explode()

extract_entity(entity_extractor: EntityExtractor, **kwargs) → DocSet[source]¶

Applies the ExtractEntity transform on the Docset.

Parameters:: entity_extractor -- An instance of an EntityExtractor class that defines the entity extraction method to be applied.

Example

title_context_template = "template"

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
entity_extractor = OpenAIEntityExtractor("title",
                       llm=openai_llm,
                       prompt_template=title_context_template)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_entity(entity_extractor=entity_extractor)

extract_graph_entities(extractors: list[GraphEntityExtractor] = [], **kwargs) → DocSet[source]¶

Extracts entites from document children. Entities are stored as nodes within each child of a document.

Parameters:: extractors -- A list of GraphEntityExtractor objects which determines how entities are extracted

Example

extract_graph_relationships(extractors: list[GraphRelationshipExtractor] = [], **kwargs) → DocSet[source]¶

Extracts relationships from document children. Relationships are stored within the nodes they reference within each child of a document.

Parameters:: extractors -- A list of GraphEntityExtractor objects which determines how relationships are extracted

Example

extract_properties(property_extractor: PropertyExtractor, **kwargs) → DocSet[source]¶

Extracts properties from each Document in this DocSet based on the _schema property.

The schema can be computed using extract_schema or extract_batch_schema or can be provided manually in JSON-schema format in the _schema field under Document.properties.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
property_extractor = OpenAIPropertyExtractor(OpenAIPropertyExtractor(llm=openai_llm))

context = sycamore.init()

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partition=ArynPartitioner())
    .extract_properties(property_extractor)

extract_schema(schema_extractor: SchemaExtractor, **kwargs) → DocSet[source]¶

Extracts a JSON schema of extractable properties from each document in this DocSet.

Each schema is a mapping of names to types that corresponds to fields that are present in the document. For example, calling this method on a financial document containing information about companies might yield a schema like

{
  "company_name": "string",
  "revenue": "number",
  "CEO": "string"
}

This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.

Parameters:: schema_extractor -- A SchemaExtractor instance to extract the schema for each document.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .extract_schema(schema_extractor=schema_extractor)

field_in(docset2: DocSet, field1: str, field2: str, **kwargs) → DocSet[source]¶

Joins two docsets based on specified fields; docset (self) filtered based on values of docset2.

SQL Equivalent: SELECT * FROM docset1 WHERE field1 IN (SELECT field2 FROM docset2);

Parameters:

docset2 -- DocSet to filter.
field1 -- Field in docset1 to filter based on.
field2 -- Field in docset2 to filter.

Returns:

A left semi-join between docset (self) and docset2.

filter(f: Callable[[Document], bool], **kwargs) → DocSet[source]¶

Applies the Filter transform on the Docset.

Parameters:: f -- A callable function that takes a Document object and returns a boolean indicating whether the document should be included in the filtered Docset.

Example

def custom_filter(doc: Document) -> bool:
    # Define your custom filtering logic here.
    return doc.some_property == some_value

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .filter(custom_filter)

filter_elements(f: Callable[[Element], bool], **resource_args) → DocSet[source]¶

Applies the given filter function to each element in each Document in this DocsSet.

Parameters:: f -- A Callable that takes an Element and returns True if the element should be retained.

flat_map(f: Callable[[Document], list[Document]], **resource_args) → DocSet[source]¶

Applies the FlatMap transformation on the Docset.

Parameters:: f -- The function to apply to each document.

See the FlatMap documentation for advanced features.

Example

def custom_flat_mapping_function(document: Document) -> list[Document]:
    # Custom logic to transform the document and return a list of documents
    return [transformed_document_1, transformed_document_2]

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
.partition(partitioner=ArynPartitioner())
.flat_map(custom_flat_mapping_function)

groupby(grouped_key: str | list[str], entity: str | None = None) → GroupedData[source]¶: The original name for the grouped key, e.g. when grouped_key refers to an embedding, the entity name would be the column which generates the embedding.

groupby_count(field: str, unique_field: str | None = None, **kwargs) → DocSet[source]¶

Performs a count aggregation on a DocSet.

Parameters:

field -- Field to aggregate based on.
unique_field -- Determines what makes a unique document.
**kwargs

Returns:

A DocSet with "properties.key" (unique values of document field) and "properties.count" (frequency counts for unique values).

infer_schema(llm: LLM, existing_schema: SchemaV2 | None = None, reduce_fn: Callable[[list[Document]], Document] | None = None, prompt: ExtractionJinjaPrompt | None = None, step_through_strategy: StepThroughStrategy | None = None) → DocSet[source]¶

Extracts a common schema from the documents in this DocSet. This transform is similar to extract_schema, except that it will add the same schema to each document in the DocSet rather than inferring a separate schema per Document. :param llm: An instance of an LLM class that defines the LLM to be used for schema extraction. :param existing_schema: An optional existing schema to provide as context to the LLM. :param reduce_fn: A function that takes a list of Documents (each with a _schema property) and

returns a single Document with the combined schema. If None, defaults to intersection_of_fields.

Example

openai_llm = OpenAI(OpenAIModels.GPT_4O.value)
context = sycamore.init()
docset = context.read.binary(paths, binary_format="pdf").partition(partitioner=ArynPartitioner())
schema = docset.suggest_schema(llm=openai_llm, reduce_fn=intersection_of_fields)

kmeans(K: int, iterations: int = 20, init_mode: str = 'random', epsilon: float = 0.0001, field_name: str | None = None)[source]¶

Apply kmeans over embedding field

Parameters:

K -- the count of centroids
iterations -- the max iteration runs before converge
init_mode -- how the initial centroids are select
epsilon -- the condition for determining if it's converged
field_name -- the field used to run kmeans, use default embedding if it's None

Return a list of max K centroids

limit(limit: int = 20, **kwargs) → DocSet[source]¶

Applies the Limit transforms on the Docset.

Parameters:: limit -- The maximum number of documents to include in the resulting Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .explode()
    .limit()

llm_cluster_entity(llm: LLM, instruction: str, field: str, **kwargs) → DocSet[source]¶

Normalizes a particular field of a DocSet. Identifies and assigns each document to a "group".

Parameters:

llm -- LLM client.
instruction -- Instruction about groups to form, e.g. 'Form groups for different types of food'
field -- Field to make/assign groups based on, e.g. 'properties.entity.food'

Returns:

A DocSet with an additional field "properties._autogen_ClusterAssignment" that contains the assigned group. For example, if "properties.entity.food" has values 'banana', 'milk', 'yogurt', 'chocolate', 'orange', "properties._autogen_ClusterAssignment" would contain values like 'fruit', 'dairy', and 'dessert'.

llm_clustering(llm: LLM, groups: list[str], field: str, new_field: str = '_autogen_ClusterAssignment', **kwargs) → DocSet[source]¶

Normalizes a particular field of a DocSet. Identifies and assigns each document to a "group".

Parameters:

llm -- LLM client.
groups -- groups to cluster on

Returns:

llm_filter(llm: LLM, new_field: str, prompt: SycamorePrompt, field: str = 'text_representation', threshold: int = 3, keep_none: bool = False, use_elements: bool = False, similarity_query: str | None = None, similarity_scorer: SimilarityScorer | None = None, max_tokens: int = 512, tokenizer: Tokenizer | None = None, **resource_args) → DocSet[source]¶

Filters DocSet to only keep documents that score (determined by LLM) greater than or equal to the inputted threshold value.

Parameters:

llm -- LLM to use.
new_field -- The field that will be added to the DocSet with the outputs.
prompt -- LLM prompt.
field -- Document field to filter based on.
threshold -- If the value of the computed result is an integer value greater than or equal to this threshold, the document will be kept.
keep_none -- keep records with a None value for the provided field to filter on. Warning: using this might hide data corruption issues.
use_elements -- use contents of a document's elements to filter as opposed to document level contents.
similarity_query -- query string to compute similarity against. Also requires a 'similarity_scorer'.
similarity_scorer -- scorer used to generate similarity scores used in element sorting. Also requires a 'similarity_query'.
**resource_args

Returns:

A filtered DocSet.

llm_map(prompt: SycamorePrompt, output_field: str, llm: LLM, llm_mode: LLMMode = LLMMode.SYNC, **kwargs) → DocSet[source]¶

Renders and runs a prompt on every Document of the DocSet.

Parameters:

prompt -- The prompt to use. Must implement the render_document method
output_field -- Field in properties to store the output.
llm -- LLM to use for the inferences.
llm_mode -- how to make the api calls to the llm - sync/async/batch

llm_map_elements(prompt: SycamorePrompt, output_field: str, llm: LLM, llm_mode: LLMMode = LLMMode.SYNC, **kwargs) → DocSet[source]¶

Renders and runs a prompt on every Element of every Document in the DocSet.

Parameters:

prompt -- The prompt to use. Must implement the render_element method
output_field -- Field in properties to store the output.
llm -- LLM to use for the inferences.
llm_mode -- how to make the api calls to the llm - sync/async/batch

llm_query(query_agent: LLMTextQueryAgent, **kwargs) → DocSet[source]¶

Executes an LLM Query on a specified field (element or document), and returns the response

Parameters:

prompt -- A prompt to be passed into the underlying LLM execution engine
llm -- The LLM Client to be used here. It is defined as an instance of the LLM class in Sycamore.
output_property -- (Optional, default="llm_response") The output property of the document or element to add results in.
format_kwargs -- (Optional, default="None") If passed in, details the formatting details that must be passed into the underlying Jinja Sandbox.
number_of_elements -- (Optional, default="None") When "per_element" is true, limits the number of elements to add an "output_property". Otherwise, the response is added to the entire document using a limited prefix subset of the elements.
llm_kwargs -- (Optional) LLM keyword argument for the underlying execution engine
per_element -- (Optional, default="{}") Keyword arguments to be passed into the underlying LLM execution engine.
element_type -- (Optional) Parameter to only execute the LLM query on a particular element type. If not specified, the query will be executed on all elements.

map(f: Callable[[Document], Document], **resource_args) → DocSet[source]¶

Applies the Map transformation on the Docset.

Parameters:: f -- The function to apply to each document.

See the Map documentation for advanced features.

map_batch(f: Callable[[list[Document]], list[Document]], f_args: Iterable[Any] | None = None, f_kwargs: dict[str, Any] | None = None, f_constructor_args: Iterable[Any] | None = None, f_constructor_kwargs: dict[str, Any] | None = None, **resource_args) → DocSet[source]¶

The map_batch transform is similar to map, except that it processes a list of documents and returns a list of documents. map_batch is ideal for transformations that get performance benefits from batching.

See the MapBatch documentation for advanced features.

Example

def custom_map_batch_function(documents: list[Document]) -> list[Document]:
    # Custom logic to transform the documents
    return transformed_documents

map_ds = input_ds.map_batch(f=custom_map_batch_function)

def CustomMappingClass():
    def __init__(self, arg1, arg2, *, kw_arg1=None, kw_arg2=None):
        self.arg1 = arg1
        # ...

    def _process(self, doc: Document) -> Document:
        doc.properties["arg1"] = self.arg1
        return doc

    def __call__(self, docs: list[Document], fnarg1, *, fnkwarg1=None) -> list[Document]:
        return [self._process(d) for d in docs]

map_ds = input_ds.map_batch(f=CustomMappingClass,
                            f_args=["fnarg1"], f_kwargs={"fnkwarg1": "stuff"},
                            f_constructor_args=["arg1", "arg2"],
                            f_constructor_kwargs={"kw_arg1": 1, "kw_arg2": 2})

map_elements(f: Callable[[Element], Element], **resource_args) → DocSet[source]¶

Applies the given mapping function to each element in the each Document in this DocsSet.

Parameters:: f -- A Callable that takes an Element and returns an Element. Elements for which f evaluates to None are dropped.

mark_bbox_preset(tokenizer: Tokenizer, token_limit: int = 512, **kwargs) → DocSet[source]¶

Convenience composition of:

SortByPageBbox
MarkDropTiny minimum=2
MarkDropHeaderFooter top=0.05 bottom=0.05
MarkBreakPage
MarkBreakByColumn
MarkBreakByTokens limit=512

Meant to work in concert with MarkedMerger.

Use this method like so:

context = sycamore.init()
token_limit = 512
paths = ["path/to/pdf1.pdf", "path/to/pdf2.pdf"]

(context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .mark_bbox_preset(tokenizer, token_limit)
    .merge(merger=MarkedMerger())
    .split_elements(tokenizer, token_limit)
    .show())

If you want to compose your own marking, note that docset.mark_bbox_preset(...) is equivalent to:

(docset.transform(SortByPageBbox)
    .transform(MarkDropTiny, minimum=2)
    .transform(MarkDropHeaderFooter, top=0.05, bottom=0.05)
    .transform(MarkBreakPage)
    .transform(MarkBreakByColumn)
    .transform(MarkBreakByTokens, tokenizer=tokenizer, limit=token_limit))

markdown(**kwargs) → DocSet[source]¶

Modifies Document to have a single Element containing the Markdown representation of all the original elements.

Example

context = sycamore.init()
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .markdown()
    .explode()

materialize(path: Path | str | dict | None = None, source_mode: MaterializeSourceMode = MaterializeSourceMode.RECOMPUTE) → DocSet[source]¶

The materialize transform writes out documents up to that point, marks the materialized path as successful if execution is successful, and allows for reading from the materialized data as a source. This transform is helpful if you are using show and take() as part of a notebook to incrementally inspect output. You can use materialize to avoid re-computation.

path: a Path or string represents the "directory" for the materialized elements. The filesystem

and naming convention will be inferred. The dictionary variant allowes finer control, and supports { root=Path|str, fs=pyarrow.fs, name=lambda Document -> str, clean=True,

tobin=Document.serialize()}

root is required

source_mode: how this materialize step should be used as an input:

RECOMPUTE: (default) the transform does not act as a source, previous transforms: will be recomputed.
USE_STORED: If the materialize has successfully run to completion, or if the: materialize step has no prior step, use the stored contents of the directory as the inputs. No previous transform will be computed. WARNING: If you change the input files or any of the steps before the materialize step, you need to use clear_materialize() or change the source_mode to force re-execution.

Note: you can write the source mode as MaterializeSourceMode.SOMETHING after importing MaterializeSourceMode, or as sycamore.MATERIALIZE_SOMETHING after importing sycamore.

merge(merger: ElementMerger, **kwargs) → DocSet[source]¶

Applies merge operation on each list of elements of the Docset

Example

from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
merger = GreedyElementMerger(tk, 512)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .merge(merger=merger)

partition(partitioner: Partitioner, table_extractor: TableExtractor | None = None, **kwargs) → DocSet[source]¶

Applies the Partition transform on the Docset.

More information can be found in the Partition documentation.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())

query(query_executor: QueryExecutor, **resource_args) → DocSet[source]¶

Applies a query execution transform on a DocSet of queries.

Parameters:: query_executor -- Implementation for the query execution.

random_sample(fraction: float, seed: int | None = None) → DocSet[source]¶

Retain a random sample of documents from this DocSet.

The number of documents in the output will be approximately fraction * self.count()

Parameters:

fraction -- The fraction of documents to retain.
seed -- Optional seed to use for the RNG.

regex_replace(spec: list[tuple[str, str]], **kwargs) → DocSet[source]¶

Performs regular expression replacement (using re.sub()) on the text_representation of every Element in each Document.

Example

from sycamore.transforms import COALESCE_WHITESPACE
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .regex_replace(COALESCE_WHITESPACE)
    .regex_replace([(r"\d+", "1313"), (r"old", "new")])
    .explode()

rerank(similarity_scorer: SimilarityScorer, query: str, score_property_name: str = '_rerank_score', limit: int | None = None) → DocSet[source]¶

Sort a DocSet given a scoring class.

Parameters:

similarity_scorer -- An instance of an SimilarityScorer class that executes the scoring function.
query -- The query string to compute similarity against.
score_property_name -- The name of the key where the score will be stored in document.properties
limit -- Limit scoring and sorting to fixed size.

resolve_graph_entities(resolvers: list[EntityResolver] = [], resolve_duplicates=True, **kwargs) → DocSet[source]¶

Resolves graph entities across documents so that duplicate entities can be resolved to the same entity based off criteria of EntityResolver objects.

Parameters:

resolvers -- A list of EntityResolvers that are used to determine what entities are duplicates
resolve_duplicates -- If exact duplicate entities and relationships should be merged. Defaults to true

Example

show(limit: int = 20, show_elements: bool = True, num_elements: int = -1, show_binary: bool = False, show_embedding: bool = False, truncate_content: bool = True, truncate_length: int = 100, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) → None[source]¶

Prints the content of the docset in a human-readable format. It is useful for debugging and inspecting the contents of objects during development.

Parameters:

limit -- The maximum number of items to display.
show_elements -- Whether to display individual elements or not.
num_elements -- The number of elements to display. Use -1 to show all elements.
show_binary -- Whether to display binary data or not.
show_embedding -- Whether to display embedding information or not.
truncate_content -- Whether to truncate long content when displaying.
truncate_length -- The maximum length of content to display when truncating.
stream -- The output stream where the information will be displayed.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .show()

sketch(window: int = 17, number: int = 16, **kwargs) → DocSet[source]¶

For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents' sketches can be compared to determine if they have near-duplicate content.

Parameters:

window -- Number of bytes in the sliding window that is hashed (17)
number -- Count of hashes comprising a shingle (16)

Example

ds = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=ArynPartitioner())
     .explode()
     .sketch(window=17)

sort(descending: bool, field: str, default_val: Any | None = None) → DocSet[source]¶

Sort DocSet by specified field.

Parameters:

descending -- Whether or not to sort in descending order (first to last).
field -- Document field in relation to Document using dotted notation, e.g. properties.filetype
default_val -- Default value to use if field does not exist in Document

split_elements(tokenizer: Tokenizer, max_tokens: int = 512, **kwargs) → DocSet[source]¶

Splits elements if they are larger than the maximum number of tokens.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=ArynPartitioner())
     .split_elements(tokenizer=tokenizer, max_tokens=512)
     .explode()

spread_properties(props: list[str], **resource_args) → DocSet[source]¶

Copies listed properties from parent document to child elements.

Example

pdf_docset = context.read.binary(paths, binary_format="pdf")
     .partition(partitioner=ArynPartitioner())
     .spread_properties(["title"])
     .explode()

summarize(summarizer: Summarizer, **kwargs) → DocSet[source]¶

Applies the Summarize transform on the Docset.

Example

llm_model = OpenAILanguageModel("gpt-3.5-turbo")
element_operator = my_element_selector  # A custom element selection function
summarizer = LLMElementTextSummarizer(llm_model, element_operator)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .summarize(summarizer=summarizer)

take(limit: int = 20, include_metadata: bool = False, **kwargs) → list[Document][source]¶

Returns up to limit documents from the dataset.

Parameters:: limit -- The maximum number of Documents to return.
Returns:: A list of up to limit Documents from the Docset.

Example

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .take()

take_all(limit: int | None = None, include_metadata: bool = False, **kwargs) → list[Document][source]¶

Returns all of the rows in this DocSet.

If limit is set, this method will raise an error if this Docset has more than limit Documents.

Parameters:: limit -- The number of Documents above which this method will raise an error.

take_stream(include_metadata: bool = False, **kwargs) → Iterable[Document][source]¶

Returns a stream of all rows in this DocSet.

Parameters:: include_metadata -- False [default] will filter out all MetadataDocuments from the result.

term_frequency(tokenizer: Tokenizer, with_token_ids: bool = False, **kwargs) → DocSet[source]¶

For each document, compute a frequency table over the text representation, as tokenized by tokenizer. Use to enable hybrid search in Pinecone

Example

tk = OpenAITokenizer("gpt-3.5-turbo")
context = sycamore.init()
context.read.binary(paths, binary_format="pdf")
    .partition(ArynPartitioner())
    .explode()
    .term_frequency(tokenizer=tk)
    .show()

top_k(llm: LLM | None, field: str, k: int | None, descending: bool = True, llm_cluster: bool = False, unique_field: str | None = None, llm_cluster_instruction: str | None = None, **kwargs) → DocSet[source]¶

Determines the top k occurrences for a document field.

Parameters:

llm -- LLM client.
field -- Field to determine top k occurrences of.
k -- Number of top occurrences. If k is not specified, all occurences are returned.
llm_cluster_instruction -- Instruction of operation purpose. E.g. Find most common cities
descending -- Indicates whether to return most or least frequent occurrences.
llm_cluster -- Indicates whether an LLM should be used to normalize values of document field.
unique_field -- Determines what makes a unique document.
**kwargs

Returns:

A DocSet with "properties.key" (unique values of document field) and "properties.count" (frequency counts for unique values) which is sorted based on descending and contains k records.

transform(cls: Type[Transform], **kwargs) → DocSet[source]¶

Add specified transform class to pipeline. See the API reference section on transforms.

Parameters:

cls -- Class of transform to instantiate into pipeline
... -- Other keyword arguments are passed to class constructor

Example

from sycamore.transforms import FooBar
ds = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .transform(cls=FooBar, arg=123)

union(*others: DocSet) → DocSet[source]¶

Concatenate other docsets to this docset,

Parameters:: others -- other docsets to Union
Returns:: A Docset containing all the documents in this and the unioned docsets

with_properties(property_map: Mapping[str, Callable[[Document], Any]], **resource_args) → DocSet[source]¶

Adds multiple properties to each Document.

Parameters:: property_map -- A mapping of property names to functions to generate those properties

Example

docset.with_properties({
    "text_size": lambda doc: len(doc.text_representation),
    "truncated_text": lambda doc: doc.text_representation[0:256]
})

with_property(name, f: Callable[[Document], Any], **resource_args) → DocSet[source]¶

Applies a function to each document and adds the result as a property.

Parameters:

name -- The name of the property to add to each Document.
f -- The function to apply to each Document.

Example

To add a property that contains the length of the text representation as a new property: .. code-block:: python

docset.with_property("text_size", lambda doc: len(doc.text_representation))

property write¶

Exposes an interface for writing a DocSet to OpenSearch or other external storage. See DocSetWriter for more information about writers and their arguments.

Example

The following example shows reading a DocSet from a collection of PDFs, partitioning it using the ArynPartitioner, and then writing it to a new OpenSearch index.

os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_auth": ("user", "password"),
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
        },
        "mappings": {
            "properties": {
                "embedding": {
                    "type": "knn_vector",
                    "dimension": 384,
                    "method": {"name": "hnsw", "engine": "faiss"},
                },
            },
        },
    },
}

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())

pdf.write.opensearch(
     os_client_args=os_client_args,
     index_name="my_index",
     index_settings=index_settings)