Partition¶

class sycamore.transforms.partition.HtmlPartitioner(skip_headers_and_footers: bool = True, extract_tables: bool = False, text_chunker: ~sycamore.functions.chunker.Chunker = <sycamore.functions.chunker.TextOverlapChunker object>, tokenizer: ~sycamore.functions.tokenizer.Tokenizer = <sycamore.functions.tokenizer.CharacterTokenizer object>)[source]¶

Bases: Partitioner

HtmlPartitioner processes HTML documents extracting structured content.

Parameters:

skip_headers_and_footers -- Whether to skip headers and footers in the document. Default is True.
extract_tables -- Whether to extract tables from the HTML document. Default is False.
text_chunker -- The text chunking strategy to use for processing text content.
tokenizer -- The tokenizer to use for tokenizing text content.

Example

html_partitioner = HtmlPartitioner(
    skip_headers_and_footers=True,
    extract_tables=True,
    text_chunker=TokenOverlapChunker(chunk_token_count=1000, chunk_overlap_token_count=100),
    tokenizer=CharacterTokenizer(),
)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="html")
    .partition(partitioner=html_partitioner)

class sycamore.transforms.partition.Partition(child: Node, partitioner: Partitioner, table_extractor: TableExtractor | None = None, **resource_args)[source]¶

Bases: CompositeTransform

The Partition transform segments documents into elements. For example, a typical partitioner might chunk a document into elements corresponding to paragraphs, images, and tables. For almost all use cases you should use the ArynPartitioner, which calls a remote service to perform partitioning.

Parameters:

child -- The source node or component that provides the dataset to be embedded.
partitioner -- An instance of a Partitioner class to be applied
resource_args -- Additional resource-related arguments that can be passed to the Partition operation.

Example

source_node = ...  # Define a source node or component that provides a dataset.
custom_partitioner = MyPartitioner(partitioner_params)
partition_transform = Partition(child=source_node, partitioner=custom_partitioner)
partitioned_dataset = partition_transform.execute()

class sycamore.transforms.partition.ArynPartitioner(model_name_or_path='Aryn/deformable-detr-DocLayNet', threshold: float | Literal['auto'] | None = None, use_ocr: bool = False, ocr_model: str = 'easyocr', per_element_ocr: bool = True, extract_table_structure: bool = False, table_structure_extractor: TableStructureExtractor | None = None, table_extraction_options: dict[str, Any] = {}, extract_images: bool = False, extract_image_format: str = 'PPM', device=None, batch_size: int = 1, use_partitioning_service: bool = True, aryn_api_key: str = '', aryn_partitioner_address: str = 'https://api.aryn.cloud/v1/document/partition', use_cache=False, pages_per_call: int = -1, cache: Cache | None = None, output_format: str | None = None, text_extraction_options: dict[str, Any] = {}, source: str = '', output_label_options: dict[str, Any] = {}, sort_mode: str | None = None, **kwargs)[source]¶

Bases: Partitioner

The ArynPartitioner uses an object recognition model to partition the document into structured elements.

Parameters:

model_name_or_path -- The HuggingFace coordinates or model local path. Should be set to the default ARYN_DETR_MODEL unless you are testing a custom model. Ignored when local mode is false
threshold -- The threshold to use for accepting the model's predicted bounding boxes. When using Aryn DocParse, this defaults to "auto", where the service will automatically find the best predictions. You can override this or set it locally by specifying a numerical threshold between 0 and 1. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects.
use_ocr --

Whether to use OCR to extract text from the PDF. If false, we will attempt to extract
the text from the underlying PDF.

default: False
ocr_model -- model to use for OCR. Choices are "easyocr", "paddle", "tesseract" and "legacy", which correspond to EasyOCR, PaddleOCR, and Tesseract respectively, with "legacy" being a combination of Tesseract for text and EasyOCR for tables. If you choose paddle make sure to install paddlepaddle or paddlepaddle-gpu depending on whether you have a CPU or GPU. Further details are found at: https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html. Note: this will be ignored for Aryn DocParse, which uses its own OCR implementation. default: "easyocr"
per_element_ocr -- If true, will run OCR on each element individually instead of the entire page. Note: this will be ignored for Aryn DocParse, which uses its own OCR implementation. default: True
extract_table_structure -- If true, runs a separate table extraction model to extract cells from regions of the document identified as tables.
table_structure_extractor -- The table extraction implementaion to use when extract_table_structure is True. The default is the TableTransformerStructureExtractor. Ignored when local mode is false.
table_extraction_options -- Dictionary of options that are sent to the TableExtractor implementation. Currently supports union_tokens, which is a boolean that controls whether to union OCR / PDFMiner tokens in the table cells. default: {"union_tokens": False}
extract_images --

If true, crops each region identified as an image and attaches it to the associated
ImageElement. This can later be fed into the SummarizeImages transform.

default: False
device -- Device on which to run the partitioning model locally. One of 'cpu', 'cuda', and 'mps'. If not set, Sycamore will choose based on what's available. If running remotely, this doesn't matter.
batch_size -- How many pages to partition at once, when running locally. Default is 1. Ignored when running remotely.
local -- If false, runs the partitioner remotely. Defaults to false
aryn_api_key -- The account token used to authenticate with Aryn's servers.
aryn_partitioner_address -- The address of the server to use to partition the document
use_cache -- Cache results from the partitioner for faster inferences on the same documents in future runs. default: False
pages_per_call -- Number of pages to send in a single call to the remote service. Default is -1, which means send all pages in one call.
output_format -- controls output representation: json (default) or markdown.
text_extraction_options -- Dict of options that are sent to the TextExtractor implementation, either pdfminer or OCR. Currently supports the 'object_type' property for pdfminer, which can be set to 'boxes' or 'lines' to control the granularity of output.
source -- The application that is using the partitioner. This is used for logging purposes.
output_label_options --
A dictionary for configuring output label behavior. It supports two options: promote_title, a boolean specifying whether to pick the largest element by font size on the first page

from among the elements on that page that have one of the types specified in title_candidate_elements and promote it to type "Title" if there is no element on the first page of type "Title" already.

title_candidate_elements, a list of strings representing the label types allowed to be promoted to
a title.

Here is an example set of output label options:
{"promote_title": True, "title_candidate_elements": ["Section-header", "Caption"]}

default: None (no element is promoted to "Title")
sort_mode -- Reading order algorithm: bbox (default) or xycut.
kwargs -- Additional keyword arguments to pass to the remote partitioner.

Example

The following shows an example of using the ArynPartitioner to partition a PDF and extract both table structure and image

context = scyamore.init()
partitioner = ArynPartitioner(local=True, extract_table_structure=True, extract_images=True)
context.read.binary(paths, binary_format="pdf")                 .partition(partitioner=partitioner)

class sycamore.transforms.partition.SycamorePartitioner(model_name_or_path='Aryn/deformable-detr-DocLayNet', threshold: float = 0.4, use_ocr=False, ocr_tables=False, extract_table_structure=False, table_structure_extractor=None, extract_images=False, device=None, batch_size: int = 1)[source]¶

Bases: ArynPartitioner

The SycamorePartitioner is equivalent to the ArynPartitioner, except that it only runs locally. This class mostly exists for backwards compatibility with scripts written before the remote partitioning service existed. Please use ArynPartitioner instead.