Split Elements

class sycamore.transforms.split_elements.SplitElements(child: Node, tokenizer: Tokenizer, maximum: int, **kwargs)[source]

Bases: SingleThreadUser, NonGPUUser, Map

The SplitElements transform recursively divides elements such that no Element exceeds a maximum number of tokens.

Parameters:
  • child -- The source node or component that provides the elements to be split

  • tokenizer -- The tokenizer to use in counting tokens, should match embedder

  • maximum -- Maximum tokens allowed in any Element

Example

node = ...  # Define a source node or component that provides hierarchical documents.
xform = SplitElements(child=node, tokenizer=tokenizer, 512)
dataset = xform.execute()
static split_doc(parent: Document, tokenizer: Tokenizer, max: int, max_depth: int = 20, add_binary: bool = True) Document[source]
Parameters:
  • parent -- the document that holds all the elements.

  • tokenizer -- tokenizer for computing the number of tokens in a chunk.

  • max -- maximum number of tokens allowed in a chunk as computed by the above tokenizer.

  • max_depth -- maximum depth of the binary tree that forms as we split each element into two recursively.

  • add_binary -- legacy feature to add text_representation as binary_representation as well.

Returns: the same parent document with split elements.