Sketch#
The sketch
transform adds metadata to each Document containing a sketch that can be used to identify near-duplicate documents. This process is the prerequisite for later removing or collapsing near-duplicate documents. Currently, the sketch consists of a set of hash values called shingles
. These are relatively inexpensive to calculate and can safely be a default part of any ingestion pipeline. Using sketch
in a Sycamore data prep pipeline is relatively easy:
docset = (context.read.binary(...)
.partition(...)
.explode()
.sketch()
.embed(...))
Query-time de-duplication is explained here. For more information, see the documentation for Sketcher.