Sketcher#

class sycamore.transforms.sketcher.SketchDebug(child: Node, threshold: float = 0.4, **kwargs)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

Removes each Document which is a near-duplicate of a Document seen before. Prints out duplicate pairs and a histogram of distances. Uses the shingles calculated by the Sketcher transform. This approach requires full materialization of the entire docset on a single node. It will store all sketches in memory. It is not suitable for large docsets.

Parameters:
  • child – The source node or component that provides the documents

  • threshold – Largest distance to be considered a duplicate (0.4)

Example

node = ...  # source node
xform = SketchUniquify(child=node)
dataset = xform.execute()
class sycamore.transforms.sketcher.SketchUniquify(child: Node, threshold: float = 0.4, **kwargs)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

Removes each Document which is a near-duplicate of a Document seen before. Uses the shingles calculated by the Sketcher transform. This approach requires full materialization of the entire docset on a single node. It will store all sketches in memory. It is not suitable for large docsets.

Parameters:
  • child – The source node or component that provides the documents

  • threshold – Largest distance to be considered a duplicate (0.4)

Example

node = ...  # source node
xform = SketchUniquify(child=node)
dataset = xform.execute()
class sycamore.transforms.sketcher.Sketcher(child: Node, window: int = 17, number: int = 16, **kwargs)[source]#

Bases: SingleThreadUser, NonGPUUser, Transform

For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents’ sketches can be compared to determine if they have near-duplicate content. The SketchUniquify transform can be used to de-duplicate small docsets in Sycamore. De-duplicating at retrieval-time is more scalable and avoids some relevance problems.

Parameters:
  • child – The source node or component that provides the documents

  • window – Number of bytes in the sliding window that is hashed (17)

  • number – Count of hashes comprising a shingle (16)

Example

node = ...  # source node or component that provides hierarchical documents.
xform = Sketcher(child=node)
dataset = xform.execute()