Summarize

class sycamore.transforms.summarize.Summarizer[source]

Bases: ABC

class sycamore.transforms.summarize.LLMElementTextSummarizer(llm: LLM, element_filter: Callable[[Element], bool] | None = None)[source]

Bases: Summarizer

LLMElementTextSummarizer uses a specified LLM to summarize text data within elements of a document.

Parameters:
  • llm -- An instance of an LLM class to use for text summarization.

  • element_operator -- A callable function that operates on the document and returns a list of elements to be summarized. Default is None.

Example

llm_model = OpenAILanguageModel("gpt-3.5-turbo")
element_operator = my_element_selector  # A custom element selection function
summarizer = LLMElementTextSummarizer(llm_model, element_operator)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .summarize(summarizer=summarizer)
class sycamore.transforms.summarize.MultiStepDocumentSummarizer(llm: ~sycamore.llms.llms.LLM, llm_mode: ~sycamore.llms.llms.LLMMode | None = None, question: str | None = None, data_description: str | None = None, prompt: ~sycamore.llms.prompts.prompts.SycamorePrompt = <sycamore.llms.prompts.prompts.JinjaPrompt object>, fields: list[str | ~typing.Type[~sycamore.transforms.summarize.EtCetera]] = [], tokenizer: ~sycamore.functions.tokenizer.Tokenizer = <sycamore.functions.tokenizer.CharacterTokenizer object>)[source]

Bases: Summarizer

Summarizes a document by constructing a tree of summaries. Each leaf contains as many consecutive elements as possible within the token limit, and each vertex of the tree contains as many sub- summaries as possible within the token limit. e.g with max_tokens=10

Elements: (3 tokens) - (3 tokens) - (5 tokens) - (8 tokens)
            |            |            |            |
            (4 token summary) - (3 token summary) - (2 token summary)
                        \             |            /
                                (5 token summary)
Parameters:
  • llm -- LLM to use for summarization

  • llm_mode -- How to call the LLM - SYNC, ASYNC, BATCH. Async is faster but not all llms support it.

  • question -- Optional question to use as context for the summarization. If set, the llm will attempt to answer the question with the data provided

  • data_description -- Optional string describing the input documents.

  • prompt -- Prompt to use for each summarization. Caution: The default (MaxTokensHeirarchicalSummarizerPrompt) has some fairly complicated logic encoded in it to make the tree construction work correctly.

  • fields -- List of fields to include in each element's representation in the prompt. Specify with dotted notation (e.g. properties.title). End the list with EtCetera to add all fields (previously specified fields go first). Default is [] which includes no fields.

  • tokenizer -- tokenizer to use when computing how many tokens a prompt will take. Default is CharacterTokenizer

batch_elements(baseline_tokens: int, elements: list[Element], etk_prompt: SycamorePrompt, document: Document) list[list[Element]][source]

Return a list of lengths of consecutive batches of elements keeping total token counts below my token limit

summarize(document: Document) Document[source]

Summarize a document by summarizing groups of elements iteratively in rounds until only one element remains; that's our new summary

summarize_one_round(document: Document, elements: list[Element], base_prompt: SycamorePrompt, etk_prompt: SycamorePrompt) list[Element][source]

Perform a 'round' of element summarization: Assemble batches of maximal amounts of elements and summarize them, attaching the resulting summaries to the first element of each batch and returning only those elements.

sycamore.transforms.summarize.MaxTokensHierarchyPrompt
class sycamore.transforms.summarize.OneStepDocumentSummarizer(llm: ~sycamore.llms.llms.LLM, question: str, tokenizer: ~sycamore.functions.tokenizer.Tokenizer = <sycamore.functions.tokenizer.CharacterTokenizer object>, fields: list[str | ~typing.Type[~sycamore.transforms.summarize.EtCetera]] = [])[source]

Bases: Summarizer

Summarizes a document in a single LLM call by taking as much data as possible from every element, spread across them evenly. Intended for use with summarize_data, where a summarizer is used to summarize an entire docset.

Parameters:
  • llm -- LLM to use for summarization

  • question -- Question to use as context for the summary. The llm will attempt to use the data provided to answer the question.

  • tokenizer -- Tokenizer to use to count tokens (to not exceed the token limit). Default is CharacterTokenizer

  • fields -- List of fields to include from every element. To include any additional fields (after the ones specified), end the list with EtCetera. Default is empty list, which stands for 'no properties'

maximize_elements(doc: Document, data_independent_ntk: int, curr_ntk: int, prompt: SycamorePrompt) tuple[bool, int, int][source]

Stuff as many elements as possible into the prompt.

Parameters:
  • doc -- The document to operate on

  • data_independent_ntk -- How many tokens are in the prompt regardless of data

  • curr_ntk -- Current token count before adding elements

  • prompt -- the sycamore prompt to use to render and count tokens

Returns:

Whether we filled up the token limit, the total tokens after adding fields,

the number of elements to use

Return type:

(bool, int, int)

preprocess(doc: Document) Document[source]

Compute which fields and how many elements to include in the prompt.

First: If specified fields has an EtCetera, add as many fields as possible. Second: Add as many elements as possible, taking evenly from each document. Third: If we can add all the elements and specified fields has an EtCetera,

add as many element fielse as possible

sycamore.transforms.summarize.OneStepSummarizerPrompt
class sycamore.transforms.summarize.EtCetera[source]

Sentinel value to sit at the end of a list of fields, signifying 'add as many additional properties as you can within the token limit'

class sycamore.transforms.summarize.Summarize(child: Node, summarizer: Summarizer, **kwargs)[source]

Bases: NonCPUUser, NonGPUUser, Map

The summarize transform generates summaries of documents or elements.