Merge#
The merge transform is responsible for 'merging' elements into larger 'chunks'. This is also known as 'chunking.'
The merge transform takes a single argument -- the merger
, which contains the logic defining which elements to merge and how to merge them. The available mergers are listed below. More information can be found in the API documentation
Greedy Text Element Merger#
The GreedyTextElementMerger
takes a tokenizer and a token limit, and merges elements together, greedily, until the combined element will overflow the token limit, at which point the merger starts work on a new merged element. If an element is already too big, the GreedyTextElementMerger
will leave it alone.
For example, using a CharacterTokenizer
with max_tokens=4
, you would have the following behavior:
A BC D EF GHI J KLMNO -> ABCD EF GHIJ KLMNO
To add this to a script:
merger = GreedyTextElementMerger(tokenizer=HuggingFaceTokenizer("sentence-transformers/all-MiniLM-L6-v2"), max_tokens=512)
merged_docset = docset.merge(merger=merger)
Greedy Section Merger#
The GreedySectionMerger
groups together different elements in a Document according to three rules. All rules are subject to the max_tokens limit and merge_across_pages flag.
It merges adjacent text elements.
It merges an adjacent Section-header and an image. The new element type is called Section-header+image.
It merges an Image and subsequent adjacent text elements.
Use it in much the same way as the text element merger:
merger = GreedySectionMerger(tokenizer=HuggingFaceTokenizer("sentence-transformers/all-MiniLM-L6-v2"), max_tokens=512)
merged_docset = docset.merge(merger=merger)
Marked Merger#
The MarkedMerger
merges elements by referencing "marks" placed on the elements by the transforms here and here.
The marks are "_break" and "_drop". The MarkedMerger
will merge elements until it hits a "_break" mark, whereupon it will start a new element. It handles elements marked with "_drop" by, well, dropping them entirely. This merger is useful when you have many rules to apply to how you want to chunk your document.
We have found that the MarkedMerger
is best used with the DocSet method docset.mark_bbox_preset
, which applies a pre-defined series of marking transforms.
marked_ds = docset.mark_bbox_preset(tokenizer=HuggingFaceTokenizer("sentence-transformers/all-MiniLM-L6-v2"), token_limit=512)
merged_ds = marked_ds.merge(merger=MarkedMerger())