Document#

class sycamore.data.document.Document(document=None, /, **kwargs)[source]#

A Document is a generic representation of an unstructured document in a format like PDF, HTML. Though different types of document may have different properties, they all contain the following common fields in Sycamore:

property bbox: BoundingBox | None#

Get the bounding box for this document.

property binary_representation: bytes | None#

The raw content of the document stored in the appropriate format. For example, the content of a PDF document will be stored as the binary_representation.

static deserialize(raw: bytes) Document[source]#

Unserialize from bytes to a Document.

property doc_id: str | None#

A unique identifier for the document. Defaults to None.

property elements: list[Element]#

A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.

property embedding: list[float] | None#

Get the embedding for this document.

field_to_value(field: str) Any[source]#

Extracts the value for a particular document field.

Parameters:

field -- The field in dotted notation to indicate nesting, e.g. properties.schema

Returns:

The value associated with the document field. Returns None if field does not exist in document.

static from_row(row: dict[str, bytes]) Document[source]#

Unserialize a Ray row back into a Document.

property lineage_id: str#

A unique identifier for the document in its lineage.

property parent_id: str | None#

In Sycamore, certain operations create parent-child relationships between documents. For example, the explode transform promotes elements to be top-level documents, and these documents retain a pointer to the document from which they were created using the parent_id field. For those documents which have no parent, parent_id is None.

property properties: dict[str, Any]#

A collection of system or customer defined properties, for instance, a PDF document might have title and author properties.

serialize() bytes[source]#

Serialize this document to bytes.

property text_representation: str | None#

The text representation of the document.

to_row() dict[str, bytes][source]#

Serialize this document into a row for use with Ray.

property type: str | None#

The type of the document, e.g. pdf, html.

update_lineage_id()[source]#

Update the lineage ID with a new identifier

class sycamore.data.document.DocumentSource(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
class sycamore.data.document.HierarchicalDocument(document=None, **kwargs)[source]#
property children: list[HierarchicalDocument]#

Returns this documents children

property elements: list[Element]#

A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.

class sycamore.data.document.MetadataDocument(document=None, **kwargs)[source]#
property binary_representation#

The raw content of the document stored in the appropriate format. For example, the content of a PDF document will be stored as the binary_representation.

property elements: list[Element]#

A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.

property lineage_id: str#

A unique identifier for the document in its lineage.

property metadata: dict[str, Any]#

Internal metadata about processing.

property properties#

A collection of system or customer defined properties, for instance, a PDF document might have title and author properties.

property text_representation#

The text representation of the document.

class sycamore.data.document.OpenSearchQuery(document=None, **kwargs)[source]#
static deserialize(raw: bytes) OpenSearchQuery[source]#

Deserialize from bytes to a OpenSearchQuery.

property headers: dict[str, Any] | None#

Dict of additional headers to send to the OpenSearch endpoint.

property index: str | None#

OpenSearch index.

property params: dict[str, Any] | None#

Dict of additional parameters to send to the OpenSearch endpoint.

property query: dict[str, Any] | None#

OpenSearch query body.

class sycamore.data.document.OpenSearchQueryResult(document=None, **kwargs)[source]#
static deserialize(raw: bytes) OpenSearchQueryResult[source]#

Deserialize from bytes to a OpenSearchQueryResult.

property generated_answer: str | None#

RAG generated answer.

property hits: list[Element]#

List of documents retrieved by the query.

property query: dict[str, Any] | None#

The unmodified query used.

property result: Any | None#

Raw result from OpenSearch