Augment Text#

class sycamore.transforms.augment_text.AugmentText(child: Node, text_augmentor: TextAugmentor, **kwargs)[source]#

Bases: NonCPUUser, NonGPUUser, Transform

The AugmentText transform puts metadata into the text representation of documents for better embedding and search quality

class sycamore.transforms.augment_text.JinjaTextAugmentor(template: str, modules: dict[str, Any] = {})[source]#

Bases: TextAugmentor

JinjaTextAugmentor uses a jinja template in a SandboxedEnvironment to transform the text representation with metadata from the thingy

  • template (str) – A jinja2 template for the new text represenation. Can contain references to doc and to any modules passed in the modules param

  • modules (dict[str, Any]) – A mapping of module names to module objects


from sycamore.transforms.augment_text import JinjaTextAugmentor
from sycamore.transforms.regex_replace import COALESCE_WHITESPACE
import pathlib
template = '''This document is from {{ pathlib.Path(['path']).name }}.
The title is {{['title'] }}.
The authors are {{['authors'] }}.
{% if doc.text_representation %}
    {{ doc.text_representation }}
{% else %}
    There is no text representation for this
{% endif %}
aug = JinjaTextAugmentor(template=template, modules={"pathlib": pathlib})
aug_docset = exp_docset.augment_text(aug).regex_replace(COALESCE_WHITESPACE), truncate_content=False)
class sycamore.transforms.augment_text.UDFTextAugmentor(fn: Callable[[Document], str])[source]#

Bases: TextAugmentor

UDFTextAugmentor augments text by calling a user-defined function (UDF) that maps documents to strings.


fn (Callable[[Document], str]) – A function that maps a document to the string to use as the new text_representation


def aug_text_fn(doc: Document) -> str:
    return " ".join([
        f"This pertains to the part {['part_name']}.",
augmentor = UDFTextAugmentor(aug_text_fn)
context = sycamore.init()
pdf_docset =, binary_format="pdf")