Document

class sycamore.data.document.Document(document=None, /, **kwargs)[source]

A Document is a generic representation of an unstructured document in a format like PDF, HTML. Though different types of document may have different properties, they all contain the following common fields in Sycamore:

property bbox

Get the bounding box for this document.

property binary_representation

The raw content of the document stored in the appropriate format. For example, the content of a PDF document will be stored as the binary_representation.

static deserialize(raw: bytes) Document[source]

Unserialize from bytes to a Document.

property doc_id

A unique identifier for the document. Defaults to None.

property elements

A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.

property embedding

Get the embedding for this document.

field_to_value(field: str) Any[source]

Extracts the value for a particular document field.

Parameters:

field -- The field in dotted notation to indicate nesting, e.g. properties.schema

Returns:

The value associated with the document field. Returns None if field does not exist in document.

static from_row(row: dict[str, bytes]) Document[source]

Unserialize a Ray row back into a Document.

property lineage_id

A unique identifier for the document in its lineage.

property parent_id

In Sycamore, certain operations create parent-child relationships between documents. For example, the explode transform promotes elements to be top-level documents, and these documents retain a pointer to the document from which they were created using the parent_id field. For those documents which have no parent, parent_id is None.

property properties

A collection of system or customer defined properties, for instance, a PDF document might have title and author properties.

serialize() bytes[source]

Serialize this document to bytes.

property text_representation

The text representation of the document.

to_row() dict[str, bytes][source]

Serialize this document into a row for use with Ray.

property type

The type of the document, e.g. pdf, html.

update_lineage_id()[source]

Update the lineage ID with a new identifier

class sycamore.data.document.HierarchicalDocument(document=None, **kwargs)[source]
property children

Returns this documents children

property elements
class sycamore.data.document.MetadataDocument(document=None, **kwargs)[source]
property binary_representation
property elements
property lineage_id

A unique identifier for the document in its lineage.

property metadata

Internal metadata about processing.

property properties
property text_representation
class sycamore.data.document.OpenSearchQuery(document=None, **kwargs)[source]
static deserialize(raw: bytes) OpenSearchQuery[source]

Deserialize from bytes to a OpenSearchQuery.

property headers

Dict of additional headers to send to the OpenSearch endpoint.

property index

OpenSearch index.

property params

Dict of additional parameters to send to the OpenSearch endpoint.

property query

OpenSearch query body.

class sycamore.data.document.OpenSearchQueryResult(document=None, **kwargs)[source]
static deserialize(raw: bytes) OpenSearchQueryResult[source]

Deserialize from bytes to a OpenSearchQueryResult.

property generated_answer

RAG generated answer.

property hits

List of documents retrieved by the query.

property query

The unmodified query used.

property result

Raw result from OpenSearch