Standarizer¶
- class sycamore.transforms.standardizer.Standardizer[source]¶
Bases:
ABCAn abstract base class for implementing standardizers, which are responsible for transforming specific fields within a document according to certain rules.
- abstract fixer(text: str) Any[source]¶
Abstract method to be implemented by subclasses to define how the relevant values should be standardized.
- Parameters:
text (str) -- The text or date string to be standardized.
- Returns:
A standardized value.
- abstract standardize(doc: Document, key_path: List[str]) Document[source]¶
Abstract method applies the fixer method to a specific field in the document as defined by the key_path.
- Parameters:
doc (Document) -- The document to be standardized.
key_path (List[str]) -- The path to the field within the document that should be standardized.
- Returns:
The document with the standardized field.
- Return type:
- Raises:
KeyError -- If any of the keys in key_path are not found in the document.
- class sycamore.transforms.standardizer.DateTimeStandardizer[source]¶
Bases:
StandardizerA standardizer for transforming date and time strings into a consistent format.
Example
source_docset = ... # Define a source node or component that provides hierarchical documents. transformed_docset = source_docset.map( lambda doc: USStateStandardizer.standardize( doc, key_path = ["path","to","datetime"]))
- static fixer(raw_dateTime: str) datetime[source]¶
Standardize a date-time string by parsing it into a datetime object.
- Parameters:
raw_dateTime (str) -- The raw date-time string to be standardized.
format -- Optional[str]: strftime-compatible format string to render the datetime.
- Returns:
A tuple containing the standardized date-time string and the corresponding datetime object.
- Return type:
Tuple[str, date]
- Raises:
ValueError -- If the input string cannot be parsed into a valid date-time.
RuntimeError -- For any other unexpected errors during the processing.
- static standardize(doc: Document, key_path: List[str], add_day: bool = True, add_dateTime: bool = True, date_format: str | None = None) Document[source]¶
Applies the fixer method to a specific date-time field in the document as defined by the key_path.
- Parameters:
doc (Document) -- The document to be standardized.
key_path (List[str]) -- The path to the date-time field within the document that should be standardized.
add_day (bool) -- Whether to add a "day" field to the document with the date extracted from the standardized date-time field. Will not overwrite an existing "day" field.
add_dateTime (bool) -- Whether to add a "dateTime" field to the document with the standardized standardized date-time field. Will not overwrite an existing "dateTime" field.
date_format (Optional[str]) -- strftime-compatible format string to render the datetime.
- Returns:
The document with the standardized date-time field and an additional "day" field.
- Return type:
- Raises:
KeyError -- If any of the keys in key_path are not found in the document.
- class sycamore.transforms.standardizer.USStateStandardizer[source]¶
Bases:
StandardizerA standardizer for transforming US state abbreviations in text to their full state names. Transforms substrings matching a state abbreviation to the full state name.
Example
source_docset = ... # Define a source node or component that provides hierarchical documents. transformed_docset = source_docset.map( lambda doc: USStateStandardizer.standardize( doc, key_path = ["path","to","location"]))
- static fixer(text: str) str[source]¶
Replaces any US state abbreviations in the text with their full state names.
- Parameters:
text (str) -- The text containing US state abbreviations.
- Returns:
The text with state abbreviations replaced by full state names.
- Return type:
str
- static standardize(doc: Document, key_path: List[str]) Document[source]¶
Applies the fixer method to a specific field in the document as defined by the key_path.
- Parameters:
doc (Document) -- The document to be standardized.
key_path (List[str]) -- The path to the field within the document that should be standardized.
- Returns:
The document with the standardized field.
- Return type:
- Raises:
KeyError -- If any of the keys in key_path are not found in the document.