Mark Misc#
- class sycamore.transforms.mark_misc.MarkBreakByTokens(child: Node, tokenizer: Tokenizer, limit: int = 512, **resource_args)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,Map
MarkBreakByTokens is a transform to add the '_break' data attribute to each Element when the number of tokens exceeds the limit. This should most likely be the last marking operation before final merge.
- Parameters:
child -- The source Node or component that provides the Elements
tokenizer -- the tokenizer that will be used for embedding
limit -- maximum permitted number of tokens
Example
source_node = ... marker = MarkBreakByTokens(child=source_node, limit=512) dataset = marker.execute()
- class sycamore.transforms.mark_misc.MarkBreakPage(child: Node, **resource_args)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,Map
MarkBreakPage is a transform to add the '_break' data attribute to each Element when the 'page_number' property changes.
- Parameters:
child -- The source Node or component that provides the Elements
Example
source_node = ... marker = MarkBreakPage(child=source_node) dataset = marker.execute()
- class sycamore.transforms.mark_misc.MarkDropTiny(child: Node, minimum: int = 2, **resource_args)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,Map
MarkDropTiny is a transform to add the '_drop' data attribute to each Element smaller than a certain size.
- Parameters:
child -- The source Node or component that provides the Elements
minimum -- The smallest Element to keep (def 2)
Example
source_node = ... marker = MarkDropTiny(child=source_node, minimum=2) dataset = marker.execute()