Partition¶
To process raw documents and bring them into DocSets, Sycamore must first segment and chunk the document and label each element, such as headings, tables, and figures. This process is called document partitioning, and it is carried out by the Partition transform. Partitioners are format specific, and we recommend:
PDF: Aryn Partitioner
HTML: Html Partitioner
Sycamore also includes the UnstructuredPdfPartitioner
for PDFs as well.
partitioned_docset = docset.partition(partitioner=ArynPartitioner())
Aryn Partitioner¶
The Aryn Partitioner was built from the ground-up for high-quality segmentation using a new AI vision model at it's core. This model is a Deformable DEtection Transformer (DETR) model trained on DocLayNet, an open source, human-annotated document layout segmentation dataset. This model is 100% open source with an Apache v2.0 license.
There are several options you can use in the Aryn Partitioner for table extraction, OCR, and more.
Parameters:
use_partitioning_service
: IfTrue
, the partitioner uses Aryn DocParse (formerly known as the Aryn Partitioning Service). Defaults toTrue
. For see options for the service, see the Aryn Partitioning Service docs.model_name_or_path
: The HuggingFace coordinates or model local path. It defaults toSYCAMORE_DETR_MODEL
, and you should only change it if you are testing a custom model. Ignored whenuse_partitioning_service
isTrue
.threshold
: This represents the threshold for accepting the model’s predicted bounding boxes. It defaults to “auto”, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, you can override this by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of 0.32.ocr_model
: model to use for OCR. Choices are "easyocr", "paddle", "tesseract" and "legacy", which correspond to EasyOCR, PaddleOCR, and Tesseract respectively, with "legacy" being a combination of Tesseract for text and EasyOCR for tables. If you choose paddle make sure to install paddlepaddle or paddlepaddle-gpu depending on whether you have a CPU or GPU. Further details are found at: https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html. Note: this will be ignored for Aryn DocParse, which uses its own OCR implementation. Defaults toeasyocr
.use_ocr
: IfTrue
, the partitioner uses OCR to extract text from the PDF. It defaults toFalse
, where the partitioner attempts to directly extract the text from the underlying PDF in the bounding box. It currently uses Tesseract for extraction.per_element_ocr
: If true, will run OCR on each element individually instead of the entire page. Note: this will be ignored for Aryn DocParse, which uses its own OCR implementation. default: Trueextract_table_structure
: IfTrue
, the partitioner runs a separate table extraction model to extract cells from regions of the document identified as tables.table_structure_extractor
: The table extraction implementation to use whenextract_table_structure
isTrue
. The default is theTableTransformerStructureExtractor
. Ignored whenuse_partitioning_service
isTrue
.extract_images
: IfTrue
, the partitioner crops each region identified as an image and attaches it to the associatedImageElement
. This can later be fed into theSummarizeImages
transform.aryn_api_key
: The account token used to authenticate with Aryn's servers.aryn_partitioner_address
: The address of the server to use to partition the documentuse_cache
: Cache results from the partitioner for faster inferences on the same documents in future runs. default: Falsepages_per_call
: Number of pages to send in a single call to the remote service. Default is -1, which means send all pages in one call.output_format
: controls output representation: json (default) or markdown.text_extraction_options
: Dict of options that are sent to the TextExtractor implementation, either pdfminer or OCR. Currently supports the 'object_type' property for pdfminer, which can be set to 'boxes' or 'lines' to control the granularity of output. Note that this has a separate implementation for the Aryn DocParse service.output_label_options
: A dictionary for configuring output label behavior. It supports two options:title_candidate_elements
, a list of strings representing the label types allowed to be promoted to a title.promote_title
, a boolean specifying whether to pick the largest element by font size on the first page from among the elements on that page that have one of the types specified in title_candidate_elements and promote it to type "Title" if there is no element on the first page of type "Title" already. Here is an example set of output label options:{"promote_title": True, "title_candidate_elements": ["Section-header", "Caption"]}
. default: None (no element is promoted to "Title").**kwargs
: Additional keyword arguments to pass to the remote partitioner. See the Aryn Partitioning Service docs for more information.
Here is an example of chunking and using table extraction:
ctx = sycamore.init()
docset = ctx.read.binary(s3://my-bucket/my-folder/, binary_format="pdf")
.partition(partitioner=ArynPartitioner(extract_table_structure=True))
Here is an example of chunking and using OCR:
ctx = sycamore.init()
docset = ctx.read.binary(s3://my-bucket/my-folder/, binary_format="pdf")
.partition(partitioner=ArynPartitioner(use_ocr=True)
HTML Partitioner¶
The HtmlPartitioner segments and chunks HTML documents by using the embedded structure of the HTML format.
Parameters:
skip_headers_and_footers
: Whether to skip headers and footers in the document. Default isTrue
.extract_tables
: Whether to extract tables from the HTML document. Default isFalse
.text_chunker
: The text chunking strategy to use for processing text content. The default is theTextOverlapChunker
, and more info is here. Default values are:chunk_token_count: 1000
,chunk_overlap_token_count: 100
.tokenizer
: The tokenizer to use for tokenizing text content. By default, the 'CharacterTokenizer` is used.
Here is an example of chunking and using table extraction:
ctx = sycamore.init()
docset = ctx.read.binary(s3://my-bucket/my-folder/, binary_format="html")
.partition(partitioner=Html_Partitioner(extract_tables=True)
Here is an example of chunking and adjusting the chunking strategy:
ctx = sycamore.init()
docset = ctx.read.binary(s3://my-bucket/my-folder/, binary_format="html")
.partition(text_chunker=TokenOverlapChunker(chunk_token_count=800, chunk_overlap_token_count=150))