Merge Elements#

class sycamore.transforms.merge_elements.GreedyTextElementMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]#

Bases: ElementMerger

merge(elt1: Element, elt2: Element) Element[source]#
Merge two elements; the new element’s fields will be set as:
  • type: “Section”

  • binary_representation: elt1.binary_representation + elt2.binary_representation

  • text_representation: elt1.text_representation + elt2.text_representation

  • bbox: the minimal bbox that contains both elt1’s and elt2’s bboxes

  • properties: elt1’s properties + any of elt2’s properties that are not in elt1

note: if elt1 and elt2 have different values for the same property, we take elt1’s value note: if any input field is None we take the other element’s field without merge logic

Parameters:
  • element1 (Tuple[Element, int]) – the first element (and number of tokens in it)

  • element2 (Tuple[Element, int]) – the second element (and number of tokens in it)

Returns:

a new merged element from the inputs (and number of tokens in it)

Return type:

Tuple[Element, int]

class sycamore.transforms.merge_elements.MarkedMerger[source]#

Bases: ElementMerger

merge_elements(document: Document) Document[source]#

Use self._should_merge and self._merge to greedily merge consecutive elements. If the next element should be merged into the last ‘accumulation’ element, merge it.

Parameters:

document (Document) – A document with elements to be merged.

Returns:

The same document, with its elements merged

Return type:

Document

class sycamore.transforms.merge_elements.Merge(child: Node, merger: ElementMerger, **kwargs)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

Merge Elements into fewer large elements