Sketcher#
- class sycamore.transforms.sketcher.SketchDebug(child: Node, threshold: float = 0.4, **kwargs)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,FlatMap
Removes each Document which is a near-duplicate of a Document seen before. Prints out duplicate pairs and a histogram of distances. Uses the shingles calculated by the Sketcher transform. This approach requires full materialization of the entire docset on a single node. It will store all sketches in memory. It is not suitable for large docsets.
- Parameters:
child -- The source node or component that provides the documents
threshold -- Largest distance to be considered a duplicate (0.4)
Example
node = ... # source node xform = SketchUniquify(child=node) dataset = xform.execute()
- class sycamore.transforms.sketcher.SketchUniquify(child: Node, threshold: float = 0.4, **kwargs)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,FlatMap
Removes each Document which is a near-duplicate of a Document seen before. Uses the shingles calculated by the Sketcher transform. This approach requires full materialization of the entire docset on a single node. It will store all sketches in memory. It is not suitable for large docsets.
- Parameters:
child -- The source node or component that provides the documents
threshold -- Largest distance to be considered a duplicate (0.4)
Example
node = ... # source node xform = SketchUniquify(child=node) dataset = xform.execute()
- class sycamore.transforms.sketcher.Sketcher(child: Node, window: int = 17, number: int = 16, **kwargs)[source]#
Bases:
SingleThreadUser
,NonGPUUser
,Map
For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents' sketches can be compared to determine if they have near-duplicate content. The SketchUniquify transform can be used to de-duplicate small docsets in Sycamore. De-duplicating at retrieval-time is more scalable and avoids some relevance problems.
- Parameters:
child -- The source node or component that provides the documents
window -- Number of bytes in the sliding window that is hashed (17)
number -- Count of hashes comprising a shingle (16)
Example
node = ... # source node or component that provides hierarchical documents. xform = Sketcher(child=node) dataset = xform.execute()