Document#
- class sycamore.data.document.Document(document=None, /, **kwargs)[source]#
A Document is a generic representation of an unstructured document in a format like PDF, HTML. Though different types of document may have different properties, they all contain the following common fields in Sycamore:
- property bbox: BoundingBox | None#
Get the bounding box for this document.
- property binary_representation: bytes | None#
The raw content of the document stored in the appropriate format. For example, the content of a PDF document will be stored as the binary_representation.
- property doc_id: str | None#
A unique identifier for the document. Defaults to None.
- property elements: list[Element]#
A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.
- property embedding: list[float] | None#
Get the embedding for this document.
- field_to_value(field: str) Any [source]#
Extracts the value for a particular document field.
- Parameters:
field -- The field in dotted notation to indicate nesting, e.g. properties.schema
- Returns:
The value associated with the document field. Returns None if field does not exist in document.
- static from_row(row: dict[str, bytes]) Document [source]#
Unserialize a Ray row back into a Document.
- property lineage_id: str#
A unique identifier for the document in its lineage.
- property parent_id: str | None#
In Sycamore, certain operations create parent-child relationships between documents. For example, the explode transform promotes elements to be top-level documents, and these documents retain a pointer to the document from which they were created using the parent_id field. For those documents which have no parent, parent_id is None.
- property properties: dict[str, Any]#
A collection of system or customer defined properties, for instance, a PDF document might have title and author properties.
- property text_representation: str | None#
The text representation of the document.
- property type: str | None#
The type of the document, e.g. pdf, html.
- class sycamore.data.document.DocumentSource(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
- class sycamore.data.document.HierarchicalDocument(document=None, **kwargs)[source]#
- property children: list[HierarchicalDocument]#
Returns this documents children
- property elements: list[Element]#
A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.
- class sycamore.data.document.MetadataDocument(document=None, **kwargs)[source]#
- property binary_representation#
The raw content of the document stored in the appropriate format. For example, the content of a PDF document will be stored as the binary_representation.
- property elements: list[Element]#
A list of elements belonging to this document. A document does not necessarily always have elements, for instance, before a document is chunked.
- property lineage_id: str#
A unique identifier for the document in its lineage.
- property metadata: dict[str, Any]#
Internal metadata about processing.
- property properties#
A collection of system or customer defined properties, for instance, a PDF document might have title and author properties.
- property text_representation#
The text representation of the document.
- class sycamore.data.document.OpenSearchQuery(document=None, **kwargs)[source]#
- static deserialize(raw: bytes) OpenSearchQuery [source]#
Deserialize from bytes to a OpenSearchQuery.
- property headers: dict[str, Any] | None#
Dict of additional headers to send to the OpenSearch endpoint.
- property index: str | None#
OpenSearch index.
- property params: dict[str, Any] | None#
Dict of additional parameters to send to the OpenSearch endpoint.
- property query: dict[str, Any] | None#
OpenSearch query body.
- class sycamore.data.document.OpenSearchQueryResult(document=None, **kwargs)[source]#
- static deserialize(raw: bytes) OpenSearchQueryResult [source]#
Deserialize from bytes to a OpenSearchQueryResult.
- property generated_answer: str | None#
RAG generated answer.
- property hits: list[Element]#
List of documents retrieved by the query.
- property query: dict[str, Any] | None#
The unmodified query used.
- property result: Any | None#
Raw result from OpenSearch