Merge Elements¶
- class sycamore.transforms.merge_elements.ElementMerger[source]¶
Bases:
ABC
- class sycamore.transforms.merge_elements.GreedySectionMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]¶
Bases:
ElementMerger
The
GreedySectionMerger
groups together different elements in a Document according to three rules. All rules are subject to the max_tokens limit and merge_across_pages flag.It merges adjacent text elements.
It merges an adjacent Section-header and an image. The new element type is called Section-header+image.
It merges an Image and subsequent adjacent text elements.
- merge(elt1: Element, elt2: Element) Element [source]¶
- Merge two elements; the new element's fields will be set as:
type: "Section"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1
note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic
- Parameters:
element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)
element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)
- Returns:
a new merged element from the inputs (and number of tokens in it)
- Return type:
Tuple[Element, int]
- class sycamore.transforms.merge_elements.GreedyTextElementMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]¶
Bases:
ElementMerger
The
GreedyTextElementMerger
takes a tokenizer and a token limit, and merges elements together, greedily, until the combined element will overflow the token limit, at which point the merger starts work on a new merged element. If an element is already too big, the GreedyTextElementMerger will leave it alone.- merge(elt1: Element, elt2: Element) Element [source]¶
- Merge two elements; the new element's fields will be set as:
type: "Section"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1
note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic
- Parameters:
element1 (Tuple[Element, int]) -- the first element (and number of tokens in it)
element2 (Tuple[Element, int]) -- the second element (and number of tokens in it)
- Returns:
a new merged element from the inputs (and number of tokens in it)
- Return type:
Tuple[Element, int]
- class sycamore.transforms.merge_elements.MarkedMerger[source]¶
Bases:
ElementMerger
The
MarkedMerger
merges elements by referencing "marks" placed on the elements by the transforms insycamore.transforms.bbox_merge
andsycamore.transforms.mark_misc
. The marks are "_break" and "_drop". The MarkedMerger will merge elements until it hits a "_break" mark, whereupon it will start a new element. It handles elements marked with "_drop" by, well, dropping them entirely.
- class sycamore.transforms.merge_elements.TableMerger(regex_pattern: Pattern | None = None, llm_prompt: str | None = None, llm: LLM | None = None, *args, **kwargs)[source]¶
Bases:
ElementMerger
The
Table merger
handles 3 operations1. If a text element (Caption, Section-header, Text...) contains the regex pattern anywhere in a page it is attached to the text_representation of the table on the page.
2. LLMQuery is used for adding a table_continuation property to table elements. If the table is a continuation from a previous table the property is stored as true, else false.
After LLMQuery, table elements which are continuations are merged as one element.
Example
llm = OpenAI(OpenAIModels.GPT_4O, api_key = '') prompt = "Analyze two CSV tables that may be parts of a single table split across pages. Determine if the second table is a continuation of the first with 100% certainty. Check either of the following: 1. Column headers: Must be near identical in terms of text(the ordering/text may contain minor errors because of OCR quality) in both tables. If the headers are almost the same check the number of columns, they should be roughly the same. 2. Missing headers: If the header/columns in the second table are missing, then the first row in the second table should logically be in continutaion of the last row in the first table. Respond with only 'true' or 'false' based on your certainty that the second table is a continuation. Certainty is determined if either of the two conditions is true." regex_pattern = r"table \d+" merger = TableMerger(llm_prompt = prompt, llm=llm) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf", regex_pattern= regex_pattern) .partition(partitioner=ArynPartitioner()) .merge(merger=merger)
- class sycamore.transforms.merge_elements.HeaderAugmenterMerger(tokenizer: Tokenizer, max_tokens: int, merge_across_pages: bool = True)[source]¶
Bases:
ElementMerger
The
HeaderAugmenterMerger
groups together different elements in a Document and enhances the text representation of the elements by adding the preceeding section-header/title.It merges certain elements ("Text", "List-item", "Caption", "Footnote", "Formula", "Page-footer", "Page-header").
It merges consecutive ("Section-header", "Title") elements.
It adds the preceeding section-header/title to the text representation of the elements (including tables/images).
- merge(elt1: Element, elt2: Element) Element [source]¶
- Merge two elements; the new element's fields will be set as:
type: "Section-header", "Text"
binary_representation: elt1.binary_representation + elt2.binary_representation
text_representation: elt1.text_representation + elt2.text_representation
bbox: the minimal bbox that contains both elt1's and elt2's bboxes
properties: elt1's properties + any of elt2's properties that are not in elt1
note: if elt1 and elt2 have different values for the same property, we take elt1's value note: if any input field is None we take the other element's field without merge logic
- Parameters:
element1 (Element) -- the first element (numbers of tokens in it is stored by preprocess_element as element1["token_count"])
element2 (Element) -- the second element (numbers of tokens in it is stored by preprocess_element as element2["token_count"])
- Returns:
a new merged element from the inputs (and number of tokens in it)
- Return type:
Element
- class sycamore.transforms.merge_elements.Merge(child: Node, merger: ElementMerger, **kwargs)[source]¶
Bases:
SingleThreadUser
,NonGPUUser
,Map
Merge Elements into fewer large elements