Merge Elements#
- class sycamore.transforms.merge_elements.ElementMerger[source]#
Bases:
ABC
- class sycamore.transforms.merge_elements.GreedySectionMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]#
Bases:
ElementMerger
The
GreedySectionMerger
groups together different elements in a Document according to three rules. All rules are subject to the max_tokens limit and merge_across_pages flag.It merges adjacent text elements.
It merges an adjacent Section-header and an image. The new element type is called Section-header+image.
It merges an Image and subsequent adjacent text elements.
- merge(elt1: Element, elt2: Element) Element [source]#
- Merge two elements; the new element's fields will be set as:
type: "Section"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1
note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic
- Parameters:
element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)
element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)
- Returns:
a new merged element from the inputs (and number of tokens in it)
- Return type:
Tuple[Element, int]
- class sycamore.transforms.merge_elements.GreedyTextElementMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]#
Bases:
ElementMerger
The
GreedyTextElementMerger
takes a tokenizer and a token limit, and merges elements together, greedily, until the combined element will overflow the token limit, at which point the merger starts work on a new merged element. If an element is already too big, the GreedyTextElementMerger will leave it alone.- merge(elt1: Element, elt2: Element) Element [source]#
- Merge two elements; the new element's fields will be set as:
type: "Section"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1
note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic
- Parameters:
element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)
element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)
- Returns:
a new merged element from the inputs (and number of tokens in it)
- Return type:
Tuple[Element, int]
- class sycamore.transforms.merge_elements.MarkedMerger[source]#
Bases:
ElementMerger
The
MarkedMerger
merges elements by referencing "marks" placed on the elements by the transforms insycamore.transforms.bbox_merge
andsycamore.transforms.mark_misc
. The marks are "_break" and "_drop". The MarkedMerger will merge elements until it hits a "_break" mark, whereupon it will start a new element. It handles elements marked with "_drop" by, well, dropping them entirely.
- class sycamore.transforms.merge_elements.Merge(child: Node, merger: ElementMerger, **kwargs)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,Map
Merge Elements into fewer large elements