Split Elements¶
- class sycamore.transforms.split_elements.SplitElements(child: Node, tokenizer: Tokenizer, maximum: int, **kwargs)[source]¶
Bases:
SingleThreadUser,NonGPUUser,MapThe SplitElements transform recursively divides elements such that no Element exceeds a maximum number of tokens.
- Parameters:
child -- The source node or component that provides the elements to be split
tokenizer -- The tokenizer to use in counting tokens, should match embedder
maximum -- Maximum tokens allowed in any Element
Example
node = ... # Define a source node or component that provides hierarchical documents. xform = SplitElements(child=node, tokenizer=tokenizer, 512) dataset = xform.execute()
- static split_doc(parent: Document, tokenizer: Tokenizer, max: int, max_depth: int = 20, add_binary: bool = True) Document[source]¶
- Parameters:
parent -- the document that holds all the elements.
tokenizer -- tokenizer for computing the number of tokens in a chunk.
max -- maximum number of tokens allowed in a chunk as computed by the above tokenizer.
max_depth -- maximum depth of the binary tree that forms as we split each element into two recursively.
add_binary -- legacy feature to add text_representation as binary_representation as well.
Returns: the same parent document with split elements.