Partition#
- class sycamore.transforms.partition.HtmlPartitioner(skip_headers_and_footers: bool = True, extract_tables: bool = False, text_chunker: ~sycamore.functions.chunker.Chunker = <sycamore.functions.chunker.TextOverlapChunker object>, tokenizer: ~sycamore.functions.tokenizer.Tokenizer = <sycamore.functions.tokenizer.CharacterTokenizer object>)[source]#
Bases:
Partitioner
HtmlPartitioner processes HTML documents extracting structured content.
- Parameters:
skip_headers_and_footers -- Whether to skip headers and footers in the document. Default is True.
extract_tables -- Whether to extract tables from the HTML document. Default is False.
text_chunker -- The text chunking strategy to use for processing text content.
tokenizer -- The tokenizer to use for tokenizing text content.
Example
html_partitioner = HtmlPartitioner( skip_headers_and_footers=True, extract_tables=True, text_chunker=TokenOverlapChunker(chunk_token_count=1000, chunk_overlap_token_count=100), tokenizer=CharacterTokenizer(), ) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="html") .partition(partitioner=html_partitioner)
- class sycamore.transforms.partition.Partition(child: Node, partitioner: Partitioner, table_extractor: TableExtractor | None = None, **resource_args)[source]#
Bases:
CompositeTransform
The Partition transform segments documents into elements. For example, a typical partitioner might chunk a document into elements corresponding to paragraphs, images, and tables. Partitioners are format specific, so for instance for HTML you can use the HtmlPartitioner and for PDFs, we provide the UnstructuredPdfPartitioner, which utilizes the unstructured open-source library.
- Parameters:
child -- The source node or component that provides the dataset to be embedded.
partitioner -- An instance of a Partitioner class to be applied
resource_args -- Additional resource-related arguments that can be passed to the Partition operation.
Example
source_node = ... # Define a source node or component that provides a dataset. custom_partitioner = MyPartitioner(partitioner_params) partition_transform = Partition(child=source_node, partitioner=custom_partitioner) partitioned_dataset = partition_transform.execute()
- class sycamore.transforms.partition.ArynPartitioner(model_name_or_path='Aryn/deformable-detr-DocLayNet', threshold: float | Literal['auto'] | None = None, use_ocr: bool = False, ocr_images: bool = False, ocr_model: str = 'easyocr', per_element_ocr: bool = True, extract_table_structure: bool = False, table_structure_extractor: TableStructureExtractor | None = None, extract_images: bool = False, device=None, batch_size: int = 1, use_partitioning_service: bool = True, aryn_api_key: str = '', aryn_partitioner_address: str = 'https://api.aryn.cloud/v1/document/partition', use_cache=False, pages_per_call: int = -1, cache: Cache | None = None, output_format: str | None = None)[source]#
Bases:
Partitioner
The ArynPartitioner uses an object recognition model to partition the document into structured elements.
- Parameters:
model_name_or_path -- The HuggingFace coordinates or model local path. Should be set to the default ARYN_DETR_MODEL unless you are testing a custom model. Ignored when local mode is false
threshold -- The threshold to use for accepting the model's predicted bounding boxes. When using the Aryn Partitioning Service, this defaults to "auto", where the service will automatically find the best predictions. You can override this or set it locally by specifying a numerical threshold between 0 and 1. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects.
use_ocr --
- Whether to use OCR to extract text from the PDF. If false, we will attempt to extract
the text from the underlying PDF.
default: False
ocr_images -- If set with use_ocr, will attempt to OCR regions of the document identified as images. default: False
ocr_model -- model to use for OCR. Choices are "easyocr", "paddle", "tesseract" and "legacy", which correspond to EasyOCR, PaddleOCR, and Tesseract respectively, with "legacy" being a combination of Tesseract for text and EasyOCR for tables. If you choose paddle make sure to install paddlepaddle or paddlepaddle-gpu depending on whether you have a CPU or GPU. Further details are found at: https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html. Note: this will be ignored for the Aryn Partitioning Service, which uses its own OCR implementation. default: "easyocr"
per_element_ocr -- If true, will run OCR on each element individually instead of the entire page. Note: this will be ignored for the Aryn Partitioning Service, which uses its own OCR implementation. default: True
extract_table_structure -- If true, runs a separate table extraction model to extract cells from regions of the document identified as tables.
table_structure_extractor -- The table extraction implementaion to use when extract_table_structure is True. The default is the TableTransformerStructureExtractor. Ignored when local mode is false.
extract_images --
- If true, crops each region identified as an image and attaches it to the associated
ImageElement. This can later be fed into the SummarizeImages transform.
default: False
device -- Device on which to run the partitioning model locally. One of 'cpu', 'cuda', and 'mps'. If not set, Sycamore will choose based on what's available. If running remotely, this doesn't matter.
batch_size -- How many pages to partition at once, when running locally. Default is 1. Ignored when running remotely.
local -- If false, runs the partitioner remotely. Defaults to false
aryn_api_key -- The account token used to authenticate with Aryn's servers.
aryn_partitioner_address -- The address of the server to use to partition the document
use_cache -- Cache results from the partitioner for faster inferences on the same documents in future runs. default: False
pages_per_call -- Number of pages to send in a single call to the remote service. Default is -1, which means send all pages in one call.
output_format -- controls output representation: json (default) or markdown.
Example
The following shows an example of using the ArynPartitioner to partition a PDF and extract both table structure and image
context = scyamore.init() partitioner = ArynPartitioner(local=True, extract_table_structure=True, extract_images=True) context.read.binary(paths, binary_format="pdf") .partition(partitioner=partitioner)
- class sycamore.transforms.partition.SycamorePartitioner(model_name_or_path='Aryn/deformable-detr-DocLayNet', threshold: float = 0.4, use_ocr=False, ocr_images=False, ocr_tables=False, extract_table_structure=False, table_structure_extractor=None, extract_images=False, device=None, batch_size: int = 1)[source]#
Bases:
ArynPartitioner
The SycamorePartitioner is equivalent to the ArynPartitioner, except that it only runs locally. This class mostly exists for backwards compatibility with scripts written before the remote partitioning service existed. Please use ArynPartitioner instead.
- class sycamore.transforms.partition.UnstructuredPPTXPartitioner(include_page_breaks: bool = False, include_metadata: bool = True, include_slide_notes: bool = False, chunking_strategy: str | None = None, **kwargs)[source]#
Bases:
Partitioner
UnstructuredPPTXPartitioner utilizes open-source Unstructured library to extract structured elements from unstructured PPTX files.
- Parameters:
include_page_breaks -- Whether to include page breaks as separate elements.
strategy -- The partitioning strategy to use ("auto" for automatic detection).
infer_table_structure -- Whether to infer table structures in the document.
ocr_languages -- The languages to use for OCR. Default is "eng" (English).
max_partition_length -- The maximum length of each partition (in characters).
include_metadata -- Whether to include metadata in the partitioned elements.
Example
pptx_partitioner = UnstructuredPPTXPartitioner( include_page_breaks=False, include_metadata=True, include_slide_notes=False, chunking_strategy=None, **kwargs ) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pptx") .partition(partitioner=pptx_partitioner)
- class sycamore.transforms.partition.UnstructuredPdfPartitioner(include_page_breaks: bool = False, strategy: str = 'auto', infer_table_structure: bool = False, languages: list[str] = ['eng'], max_partition_length: int | None = None, min_partition_length: int | None = 500, include_metadata: bool = True, retain_coordinates: bool = False)[source]#
Bases:
Partitioner
UnstructuredPdfPartitioner utilizes open-source Unstructured library to extract structured elements from unstructured PDFs.
- Parameters:
include_page_breaks -- Whether to include page breaks as separate elements.
strategy -- The partitioning strategy to use ("auto" for automatic detection).
infer_table_structure -- Whether to infer table structures in the document.
ocr_languages -- The languages to use for OCR. Default is "eng" (English).
max_partition_length -- The maximum length of each partition (in characters).
include_metadata -- Whether to include metadata in the partitioned elements.
retain_coordinates -- Whether to keep the coordinates property from unstructured. Default is False. In either case, bbox will be popuplated.
Example
pdf_partitioner = UnstructuredPdfPartitioner( include_page_breaks=True, strategy="auto", infer_table_structure=True, ocr_languages="eng", max_partition_length=2000, include_metadata=True, ) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=pdf_partitioner)