Partition#

class sycamore.transforms.partition.HtmlPartitioner(skip_headers_and_footers: bool = True, extract_tables: bool = False, text_chunker: ~sycamore.functions.chunker.Chunker = <sycamore.functions.chunker.TextOverlapChunker object>, tokenizer: ~sycamore.functions.tokenizer.Tokenizer = <sycamore.functions.tokenizer.CharacterTokenizer object>)[source]#

Bases: Partitioner

HtmlPartitioner processes HTML documents extracting structured content.

Parameters:
  • skip_headers_and_footers – Whether to skip headers and footers in the document. Default is True.

  • extract_tables – Whether to extract tables from the HTML document. Default is False.

  • text_chunker – The text chunking strategy to use for processing text content.

  • tokenizer – The tokenizer to use for tokenizing text content.

Example

html_partitioner = HtmlPartitioner(
    skip_headers_and_footers=True,
    extract_tables=True,
    text_chunker=TokenOverlapChunker(chunk_token_count=1000, chunk_overlap_token_count=100),
    tokenizer=CharacterTokenizer(),
)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="html")
    .partition(partitioner=html_partitioner)
class sycamore.transforms.partition.Partition(child: Node, partitioner: Partitioner, table_extractor: TableExtractor | None = None, **resource_args)[source]#

Bases: Transform

The Partition transform segments documents into elements. For example, a typical partitioner might chunk a document into elements corresponding to paragraphs, images, and tables. Partitioners are format specific, so for instance for HTML you can use the HtmlPartitioner and for PDFs, we provide the UnstructuredPdfPartitioner, which utilizes the unstructured open-source library.

Parameters:
  • child – The source node or component that provides the dataset to be embedded.

  • partitioner – An instance of a Partitioner class to be applied

  • resource_args – Additional resource-related arguments that can be passed to the Partition operation.

Example

source_node = ...  # Define a source node or component that provides a dataset.
custom_partitioner = MyPartitioner(partitioner_params)
partition_transform = Partition(child=source_node, partitioner=custom_partitioner)
partitioned_dataset = partition_transform.execute()
class sycamore.transforms.partition.SycamorePartitioner(model_name_or_path, threshold: float = 0.4, use_ocr=False, ocr_images=False, ocr_tables=False)[source]#

Bases: Partitioner

class sycamore.transforms.partition.UnstructuredPPTXPartitioner(include_page_breaks: bool = False, include_metadata: bool = True, include_slide_notes: bool = False, chunking_strategy: str | None = None, **kwargs)[source]#

Bases: Partitioner

UnstructuredPPTXPartitioner utilizes open-source Unstructured library to extract structured elements from unstructured PPTX files.

Parameters:
  • include_page_breaks – Whether to include page breaks as separate elements.

  • strategy – The partitioning strategy to use (“auto” for automatic detection).

  • infer_table_structure – Whether to infer table structures in the document.

  • ocr_languages – The languages to use for OCR. Default is “eng” (English).

  • max_partition_length – The maximum length of each partition (in characters).

  • include_metadata – Whether to include metadata in the partitioned elements.

Example

pptx_partitioner = UnstructuredPPTXPartitioner(
    include_page_breaks=False,
    include_metadata=True,
    include_slide_notes=False,
    chunking_strategy=None,
    **kwargs
)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pptx")
    .partition(partitioner=pptx_partitioner)
class sycamore.transforms.partition.UnstructuredPdfPartitioner(include_page_breaks: bool = False, strategy: str = 'auto', infer_table_structure: bool = False, languages: list[str] = ['eng'], max_partition_length: int | None = None, min_partition_length: int | None = 500, include_metadata: bool = True, retain_coordinates: bool = False)[source]#

Bases: Partitioner

UnstructuredPdfPartitioner utilizes open-source Unstructured library to extract structured elements from unstructured PDFs.

Parameters:
  • include_page_breaks – Whether to include page breaks as separate elements.

  • strategy – The partitioning strategy to use (“auto” for automatic detection).

  • infer_table_structure – Whether to infer table structures in the document.

  • ocr_languages – The languages to use for OCR. Default is “eng” (English).

  • max_partition_length – The maximum length of each partition (in characters).

  • include_metadata – Whether to include metadata in the partitioned elements.

  • retain_coordinates – Whether to keep the coordinates property from unstructured. Default is False. In either case, bbox will be popuplated.

Example

pdf_partitioner = UnstructuredPdfPartitioner(
    include_page_breaks=True,
    strategy="auto",
    infer_table_structure=True,
    ocr_languages="eng",
    max_partition_length=2000,
    include_metadata=True,
)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=pdf_partitioner)