Merge Elements#

class sycamore.transforms.merge_elements.ElementMerger[source]#

Bases: ABC

merge_elements(document: Document) Document[source]#

Use self.should_merge and self.merge to greedily merge consecutive elements. If the next element should be merged into the last 'accumulation' element, merge it.

Parameters:

document (Document) -- A document with elements to be merged.

Returns:

The same document, with its elements merged

Return type:

Document

class sycamore.transforms.merge_elements.GreedySectionMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]#

Bases: ElementMerger

The GreedySectionMerger groups together different elements in a Document according to three rules. All rules are subject to the max_tokens limit and merge_across_pages flag.

  • It merges adjacent text elements.

  • It merges an adjacent Section-header and an image. The new element type is called Section-header+image.

  • It merges an Image and subsequent adjacent text elements.

merge(elt1: Element, elt2: Element) Element[source]#
Merge two elements; the new element's fields will be set as:
  • type: "Section"

  • binary_representation: elt1.binary_representation + elt2.binary_representation

  • text_representation: elt1.text_representation + elt2.text_representation

  • bbox: the minimal bbox that contains both elt1's and elt2's bboxes

  • properties: elt1's properties + any of elt2's properties that are not in elt1

note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic

Parameters:
  • element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)

  • element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)

Returns:

a new merged element from the inputs (and number of tokens in it)

Return type:

Tuple[Element, int]

class sycamore.transforms.merge_elements.GreedyTextElementMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]#

Bases: ElementMerger

The GreedyTextElementMerger takes a tokenizer and a token limit, and merges elements together, greedily, until the combined element will overflow the token limit, at which point the merger starts work on a new merged element. If an element is already too big, the GreedyTextElementMerger will leave it alone.

merge(elt1: Element, elt2: Element) Element[source]#
Merge two elements; the new element's fields will be set as:
  • type: "Section"

  • binary_representation: elt1.binary_representation + elt2.binary_representation

  • text_representation: elt1.text_representation + elt2.text_representation

  • bbox: the minimal bbox that contains both elt1's and elt2's bboxes

  • properties: elt1's properties + any of elt2's properties that are not in elt1

note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic

Parameters:
  • element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)

  • element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)

Returns:

a new merged element from the inputs (and number of tokens in it)

Return type:

Tuple[Element, int]

class sycamore.transforms.merge_elements.MarkedMerger[source]#

Bases: ElementMerger

The MarkedMerger merges elements by referencing "marks" placed on the elements by the transforms in sycamore.transforms.bbox_merge and sycamore.transforms.mark_misc. The marks are "_break" and "_drop". The MarkedMerger will merge elements until it hits a "_break" mark, whereupon it will start a new element. It handles elements marked with "_drop" by, well, dropping them entirely.

merge_elements(document: Document) Document[source]#

Use self.should_merge and self.merge to greedily merge consecutive elements. If the next element should be merged into the last 'accumulation' element, merge it.

Parameters:

document (Document) -- A document with elements to be merged.

Returns:

The same document, with its elements merged

Return type:

Document

class sycamore.transforms.merge_elements.Merge(child: Node, merger: ElementMerger, **kwargs)[source]#

Bases: SingleThreadUser, NonGPUUser, Map

Merge Elements into fewer large elements