Mark Misc#

class sycamore.transforms.mark_misc.MarkBreakByTokens(child: Node, tokenizer: Tokenizer, limit: int = 512, **resource_args)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

MarkBreakByTokens is a transform to add the ‘_break’ data attribute to each Element when the number of tokens exceeds the limit. This should most likely be the last marking operation before final merge.

Parameters:
  • child – The source Node or component that provides the Elements

  • tokenizer – the tokenizer that will be used for embedding

  • limit – maximum permitted number of tokens

Example

source_node = ...
marker = MarkBreakByTokens(child=source_node, limit=512)
dataset = marker.execute()
class sycamore.transforms.mark_misc.MarkBreakPage(child: Node, **resource_args)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

MarkBreakPage is a transform to add the ‘_break’ data attribute to each Element when the ‘page_number’ property changes.

Parameters:

child – The source Node or component that provides the Elements

Example

source_node = ...
marker = MarkBreakPage(child=source_node)
dataset = marker.execute()
class sycamore.transforms.mark_misc.MarkDropTiny(child: Node, minimum: int = 2, **resource_args)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

MarkDropTiny is a transform to add the ‘_drop’ data attribute to each Element smaller than a certain size.

Parameters:
  • child – The source Node or component that provides the Elements

  • minimum – The smallest Element to keep (def 2)

Example

source_node = ...
marker = MarkDropTiny(child=source_node, minimum=2)
dataset = marker.execute()