Functions¶

class sycamore.functions.CharacterTokenizer(max_tokens: int | None = None)[source]¶

class sycamore.functions.DrawBoxes(font_path: str | None = None, default_color: str = 'blue', draw_table_cells: bool = True)[source]¶

DrawBoxes is a class for adding/drawing boxes around elements within images represented as Document objects.

This class is designed to enhance Document objects representing images with elements (e.g., text boxes, tables) by drawing bounding boxes around each element. It also allows you to customize the color mapping for different element types.

Parameters:

font_path -- The path to the TrueType font file to be used for labeling.
default_color -- The default color for bounding boxes when the element type is unknown.

Example

context = sycamore.init()

font_path="path/to/font.ttf"

pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .flat_map(split_and_convert_to_image)
    .map_batch(DrawBoxes, f_constructor_args=[font_path])

class sycamore.functions.HuggingFaceTokenizer(model_name: str)[source]¶

class sycamore.functions.OpenAITokenizer(model_name: str, max_tokens: int | None = None, lazy_load: bool = True)[source]¶

class sycamore.functions.TextOverlapChunker(chunk_token_count: int = 1000, chunk_overlap_token_count: int = 100)[source]¶

TextOverlapChunker is a class for chunking text into smaller segments while allowing for token overlap.

This class inherits from the Chunker class and is designed to divide long text tokens into chunks, each containing a specified number of tokens. It allows for a controlled overlap of tokens between adjacent chunks.

Parameters:

chunk_token_count -- The maximum number of tokens to include in each chunk.
chunk_overlap_token_count -- The number of tokens that can overlap between adjacent chunks. This value must be less than the chunk_token_count to ensure meaningful chunking.

Example

chunker = TextOverlapChunker(chunk_token_count=1000, chunk_overlap_token_count=100)
chunks = chunker.chunk(data)

class sycamore.functions.Tokenizer(max_tokens: int | None = None)[source]¶

sycamore.functions.filter_elements(document: Document, filter_function: Callable[[Element], bool]) → list[Element][source]¶

Filters the elements.

Parameters:

document -- Document for which the elements need to be filtered
filter_function -- A filter function

Returns:

List of filtered elements

sycamore.functions.reorder_elements(document: Document, *, comparator: Callable[[Element, Element], int] | None = None, key: Callable[[Element], Any] | None = None) → Document[source]¶

Reorders the elements. Must supply comparator or key.

Parameters:

document -- Document for which the elements need to be re-ordered
comparator -- A comparator function
key -- A key as per sorted()

Returns:

Document with elements re-ordered

sycamore.functions.split_and_convert_to_image(doc: Document) → list[Document][source]¶

Split a document into individual pages as images and convert them into Document objects.

This function takes a Document object, which may represent a multi-page document, and splits it into individual pages. Each page is converted into an image, and a new Document object is created for each page. The resulting list contains these new Document objects, each representing one page of the original document and elements making up the page.

The input Document object should have a binary_representation attribute containing the binary data of the pdf document. Each page's elements are preserved in the new Document objects, and page-specific properties are updated to reflect the image's size, mode, and page number.

Parameters:: doc -- The input Document to split and convert.
Returns:: A list of Document objects, each representing a single page of the original document as an image and elements making up the page.

Example

input_doc = Document(binary_representation=pdf_bytes, elements=elements, properties={"author": "John Doe"})
page_docs = split_and_convert_to_image(input_doc)