Augment Text#

class sycamore.transforms.augment_text.AugmentText(child: Node, text_augmentor: TextAugmentor, **kwargs)[source]#

Bases: NonCPUUser, NonGPUUser, Transform

The AugmentText transform puts metadata into the text representation of documents for better embedding and search quality

class sycamore.transforms.augment_text.JinjaTextAugmentor(template: str, modules: dict[str, Any] = {})[source]#

Bases: TextAugmentor

JinjaTextAugmentor uses a jinja template in a SandboxedEnvironment to transform the text representation with metadata from the thingy

Parameters:
  • template (str) – A jinja2 template for the new text represenation. Can contain references to doc and to any modules passed in the modules param

  • modules (dict[str, Any]) – A mapping of module names to module objects

Example

from sycamore.transforms.augment_text import JinjaTextAugmentor
from sycamore.transforms.regex_replace import COALESCE_WHITESPACE
import pathlib
template = '''This document is from {{ pathlib.Path(doc.properties['path']).name }}.
The title is {{ doc.properties['title'] }}.
The authors are {{ doc.properties['authors'] }}.
{% if doc.text_representation %}
    {{ doc.text_representation }}
{% else %}
    There is no text representation for this
{% endif %}
'''
aug = JinjaTextAugmentor(template=template, modules={"pathlib": pathlib})
aug_docset = exp_docset.augment_text(aug).regex_replace(COALESCE_WHITESPACE)
aug_docset.show(show_binary=False, truncate_content=False)
class sycamore.transforms.augment_text.UDFTextAugmentor(fn: Callable[[Document], str])[source]#

Bases: TextAugmentor

UDFTextAugmentor augments text by calling a user-defined function (UDF) that maps documents to strings.

Parameters:

fn (Callable[[Document], str]) – A function that maps a document to the string to use as the new text_representation

Example

def aug_text_fn(doc: Document) -> str:
    return " ".join([
        f"This pertains to the part {doc.properties['part_name']}.",
        f"{doc.text_representation}"
    ])
augmentor = UDFTextAugmentor(aug_text_fn)
context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .augment_text(augmentor)