Augment Text#
- class sycamore.transforms.augment_text.AugmentText(child: Node, text_augmentor: TextAugmentor, **kwargs)[source]#
Bases:
NonCPUUser
,NonGPUUser
,Map
The AugmentText transform puts metadata into the text representation of documents for better embedding and search quality
- class sycamore.transforms.augment_text.JinjaTextAugmentor(template: str, modules: dict[str, Any] = {})[source]#
Bases:
TextAugmentor
JinjaTextAugmentor uses a jinja template in a SandboxedEnvironment to transform the text representation with metadata from the thingy
- Parameters:
template (str) -- A jinja2 template for the new text represenation. Can contain references to doc and to any modules passed in the modules param
modules (dict[str, Any]) -- A mapping of module names to module objects
Example
from sycamore.transforms.augment_text import JinjaTextAugmentor from sycamore.transforms.regex_replace import COALESCE_WHITESPACE import pathlib template = '''This document is from {{ pathlib.Path(doc.properties['path']).name }}. The title is {{ doc.properties['title'] }}. The authors are {{ doc.properties['authors'] }}. {% if doc.text_representation %} {{ doc.text_representation }} {% else %} There is no text representation for this {% endif %} ''' aug = JinjaTextAugmentor(template=template, modules={"pathlib": pathlib}) aug_docset = exp_docset.augment_text(aug).regex_replace(COALESCE_WHITESPACE) aug_docset.show(show_binary=False, truncate_content=False)
- class sycamore.transforms.augment_text.UDFTextAugmentor(fn: Callable[[Document], str])[source]#
Bases:
TextAugmentor
UDFTextAugmentor augments text by calling a user-defined function (UDF) that maps documents to strings.
- Parameters:
fn (Callable[[Document], str]) -- A function that maps a document to the string to use as the new text_representation
Example
def aug_text_fn(doc: Document) -> str: return " ".join([ f"This pertains to the part {doc.properties['part_name']}.", f"{doc.text_representation}" ]) augmentor = UDFTextAugmentor(aug_text_fn) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .augment_text(augmentor)