Standarizer

class sycamore.transforms.standardizer.Standardizer[source]

Bases: ABC

An abstract base class for implementing standardizers, which are responsible for transforming specific fields within a document according to certain rules.

abstract fixer(text: str) Any[source]

Abstract method to be implemented by subclasses to define how the relevant values should be standardized.

Parameters:

text (str) -- The text or date string to be standardized.

Returns:

A standardized value.

abstract standardize(doc: Document, key_path: List[str]) Document[source]

Abstract method applies the fixer method to a specific field in the document as defined by the key_path.

Parameters:
  • doc (Document) -- The document to be standardized.

  • key_path (List[str]) -- The path to the field within the document that should be standardized.

Returns:

The document with the standardized field.

Return type:

Document

Raises:

KeyError -- If any of the keys in key_path are not found in the document.

class sycamore.transforms.standardizer.DateTimeStandardizer[source]

Bases: Standardizer

A standardizer for transforming date and time strings into a consistent format.

Example

source_docset = ...  # Define a source node or component that provides hierarchical documents.
transformed_docset = source_docset.map(
    lambda doc: USStateStandardizer.standardize(
        doc,
        key_path = ["path","to","datetime"]))
static fixer(raw_dateTime: str) datetime[source]

Standardize a date-time string by parsing it into a datetime object.

Parameters:
  • raw_dateTime (str) -- The raw date-time string to be standardized.

  • format -- Optional[str]: strftime-compatible format string to render the datetime.

Returns:

A tuple containing the standardized date-time string and the corresponding datetime object.

Return type:

Tuple[str, date]

Raises:
  • ValueError -- If the input string cannot be parsed into a valid date-time.

  • RuntimeError -- For any other unexpected errors during the processing.

static standardize(doc: Document, key_path: List[str], add_day: bool = True, add_dateTime: bool = True, date_format: str | None = None) Document[source]

Applies the fixer method to a specific date-time field in the document as defined by the key_path.

Parameters:
  • doc (Document) -- The document to be standardized.

  • key_path (List[str]) -- The path to the date-time field within the document that should be standardized.

  • add_day (bool) -- Whether to add a "day" field to the document with the date extracted from the standardized date-time field. Will not overwrite an existing "day" field.

  • add_dateTime (bool) -- Whether to add a "dateTime" field to the document with the standardized standardized date-time field. Will not overwrite an existing "dateTime" field.

  • date_format (Optional[str]) -- strftime-compatible format string to render the datetime.

Returns:

The document with the standardized date-time field and an additional "day" field.

Return type:

Document

Raises:

KeyError -- If any of the keys in key_path are not found in the document.

class sycamore.transforms.standardizer.USStateStandardizer[source]

Bases: Standardizer

A standardizer for transforming US state abbreviations in text to their full state names. Transforms substrings matching a state abbreviation to the full state name.

Example

source_docset = ...  # Define a source node or component that provides hierarchical documents.
transformed_docset = source_docset.map(
    lambda doc: USStateStandardizer.standardize(
        doc,
        key_path = ["path","to","location"]))
static fixer(text: str) str[source]

Replaces any US state abbreviations in the text with their full state names.

Parameters:

text (str) -- The text containing US state abbreviations.

Returns:

The text with state abbreviations replaced by full state names.

Return type:

str

static standardize(doc: Document, key_path: List[str]) Document[source]

Applies the fixer method to a specific field in the document as defined by the key_path.

Parameters:
  • doc (Document) -- The document to be standardized.

  • key_path (List[str]) -- The path to the field within the document that should be standardized.

Returns:

The document with the standardized field.

Return type:

Document

Raises:

KeyError -- If any of the keys in key_path are not found in the document.