Merge Elements¶

class sycamore.transforms.merge_elements.ElementMerger[source]¶

Bases: ABC

merge_elements(document: Document) → Document[source]¶

Use self.should_merge and self.merge to greedily merge consecutive elements. If the next element should be merged into the last 'accumulation' element, merge it.

Parameters:: document (Document) -- A document with elements to be merged.
Returns:: The same document, with its elements merged
Return type:: Document

class sycamore.transforms.merge_elements.GreedySectionMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]¶

Bases: ElementMerger

The GreedySectionMerger groups together different elements in a Document according to three rules. All rules are subject to the max_tokens limit and merge_across_pages flag.

It merges adjacent text elements.
It merges an adjacent Section-header and an image. The new element type is called Section-header+image.
It merges an Image and subsequent adjacent text elements.

merge(elt1: Element, elt2: Element) → Element[source]¶

Merge two elements; the new element's fields will be set as:

type: "Section"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1

note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic

Parameters:

element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)
element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)

Returns:

a new merged element from the inputs (and number of tokens in it)

Return type:

Tuple[Element, int]

class sycamore.transforms.merge_elements.GreedyTextElementMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]¶

Bases: ElementMerger

The GreedyTextElementMerger takes a tokenizer and a token limit, and merges elements together, greedily, until the combined element will overflow the token limit, at which point the merger starts work on a new merged element. If an element is already too big, the GreedyTextElementMerger will leave it alone.

merge(elt1: Element, elt2: Element) → Element[source]¶

Merge two elements; the new element's fields will be set as:

type: "Section"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1

note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic

Parameters:

element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)
element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)

Returns:

a new merged element from the inputs (and number of tokens in it)

Return type:

Tuple[Element, int]

class sycamore.transforms.merge_elements.MarkedMerger[source]¶

Bases: ElementMerger

The MarkedMerger merges elements by referencing "marks" placed on the elements by the transforms in sycamore.transforms.bbox_merge and sycamore.transforms.mark_misc. The marks are "_break" and "_drop". The MarkedMerger will merge elements until it hits a "_break" mark, whereupon it will start a new element. It handles elements marked with "_drop" by, well, dropping them entirely.

merge_elements(document: Document) → Document[source]¶

Use self.should_merge and self.merge to greedily merge consecutive elements. If the next element should be merged into the last 'accumulation' element, merge it.

Parameters:: document (Document) -- A document with elements to be merged.
Returns:: The same document, with its elements merged
Return type:: Document

class sycamore.transforms.merge_elements.TableMerger(regex_pattern: Pattern | None = None, llm_prompt: str | None = None, llm: LLM | None = None, sort_mode: str | None = None, *args, **kwargs)[source]¶

Bases: ElementMerger

The Table merger handles 3 operations

1. If a text element (Caption, Section-header, Text...) contains the regex pattern anywhere in a page it is attached to the text_representation of the table on the page.

2. LLMQuery is used for adding a table_continuation property to table elements. If the table is a continuation from a previous table the property is stored as true, else false.

After LLMQuery, table elements which are continuations are merged as one element.

Example

llm = OpenAI(OpenAIModels.GPT_4O, api_key = '')

prompt = "Analyze two CSV tables that may be parts of a single table split across pages. Determine            if the second table is a continuation of the first with 100% certainty. Check either of the following:            1. Column headers: Must be near identical in terms of text(the ordering/text may contain minor errors             because of OCR quality) in both tables. If the headers are almost the same check the number of columns,                 they should be roughly the same.             2. Missing headers: If the header/columns in the second table are missing, then the first row in the
second table should logically be in continutaion of the last row in the first table.            Respond with only 'true' or 'false' based on your certainty that the second table is a continuation.             Certainty is determined if either of the two conditions is true."

regex_pattern = r"table \d+"

merger = TableMerger(llm_prompt = prompt, llm=llm)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf", regex_pattern= regex_pattern)
    .partition(partitioner=ArynPartitioner())
    .merge(merger=merger)

merge_elements(document: Document) → Document[source]¶

Use self.should_merge and self.merge to greedily merge consecutive elements. If the next element should be merged into the last 'accumulation' element, merge it.

Parameters:: document (Document) -- A document with elements to be merged.
Returns:: The same document, with its elements merged
Return type:: Document

class sycamore.transforms.merge_elements.HeaderAugmenterMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]¶

Bases: ElementMerger

The HeaderAugmenterMerger groups together different elements in a Document and enhances the text representation of the elements by adding the preceeding section-header/title.

It merges certain elements ("Text", "List-item", "Caption", "Footnote", "Formula", "Page-footer", "Page-header").
It merges consecutive ("Section-header", "Title") elements.
It adds the preceeding section-header/title to the text representation of the elements (including tables/images).

merge(elt1: Element, elt2: Element) → Element[source]¶

Merge two elements; the new element's fields will be set as:

type: "Section-header", "Text"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1

note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic

Parameters:

element1 (Element) -- the first element (numbers of tokens in it is stored by preprocess_element as element1["token_count"])
element2 (Element) -- the second element (numbers of tokens in it is stored by preprocess_element as element2["token_count"])

Returns:

a new merged element from the inputs (and number of tokens in it)

Return type:

Element

merge_elements(document: Document) → Document[source]¶

Use self.should_merge and self.merge to greedily merge consecutive elements. If the next element should be merged into the last 'accumulation' element, merge it.

Parameters:: document (Document) -- A document with elements to be merged.
Returns:: The same document, with its elements merged
Return type:: Document

class sycamore.transforms.merge_elements.Merge(child: Node, merger: ElementMerger, **kwargs)[source]¶

Bases: SingleThreadUser, NonGPUUser, Map

Merge Elements into fewer large elements