Filter#
The filter transform lets you retain or discard documents from a DocSet based on a predicate. For example, the following code filters a DocSet to retain only those Documents containing at least 2000 characters.
docset.filter(lambda doc: sum(len(el.text_representation)
for el in doc.elements
if el.text_representation is not None) >= 2000)
We used a lambda function here, which works well for simple funtions, but we can also use any Python callable. For instance, if we frequently want to filter by different document lengths, we can create a LengthFilter
class, and pass that to the filter
function.
class LengthFilter:
def __init__(self, length: int):
self.length = length
def __call__(self, doc: Document) -> Document:
total = 0
for el in doc.elements:
if el.text_representation is not None:
total += len(el.text_representation)
return total >= self.length
docset.filter(LengthFilter(2000))
Note that __call__
must take a single argument of type Document
and return a Document
, but the __init__
method can take additional parameters.
FilterElements#
In addition to filtering entire documents, we also provide a convenience method for filtering elements from each Document in a DocSet. In this case, we supply a predicate that takes in elements and returns whether the element should be retained in the document. For example, if we aren't interested in processing images in our documents, we could filter them out with
docset.filter_elements(lambda el: el.type != "Image")