Split Elements#

class sycamore.transforms.split_elements.SplitElements(child: Node, tokenizer: Tokenizer, maximum: int, **kwargs)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

The SplitElements transform recursively divides elements such that no Element exceeds a maximum number of tokens.

Parameters:
  • child – The source node or component that provides the elements to be split

  • tokenizer – The tokenizer to use in counting tokens, should match embedder

  • maximum – Most tokens allowed in any Element

Example

node = ...  # Define a source node or component that provides hierarchical documents.
xform = SplitElements(child=node, tokenizer=tokenizer, 512)
dataset = xform.execute()