Basics#

class sycamore.transforms.basics.Filter(child: Node, *, f: Callable[[Document], bool], **resource_args)[source]#

Bases: MapBatch

Filter is a transformation that applies a user-defined filter function to a dataset.

Parameters:
  • child -- The source node or component that provides the dataset to be filtered.

  • f -- A callable function that takes a Document object and returns a boolean indicating whether the document should be included in the filtered dataset.

  • resource_args -- Additional resource-related arguments that can be passed to the filtering operation.

Example

source_node = ...  # Define a source node or component that provides a dataset.
def custom_filter(doc: Document) -> bool:
    # Define your custom filtering logic here.
    return doc.some_property == some_value

filter_transform = Filter(child=source_node, f=custom_filter)
filtered_dataset = filter_transform.execute()
class sycamore.transforms.basics.Limit(child: Node, limit: int)[source]#

Bases: NonCPUUser, NonGPUUser, Transform

Limit is a transformation that restricts the size of a dataset to a specified number of records.

Parameters:
  • child -- The source node or component that provides the dataset to be limited.

  • limit -- The maximum number of records to include in the resulting dataset.

Example

source_node = ...  # Define a source node or component that provides a dataset.
limit_transform = Limit(child=source_node, limit=100)
limited_dataset = limit_transform.execute()