DocSetWriter#

class sycamore.writer.DocSetWriter(context: Context, plan: Node)[source]#

Contains interfaces for writing to external storage systems, most notably OpenSearch.

Users should not instantiate this class directly, but instead access an instance using sycamore.docset.DocSet.write()

files(path: str, filesystem: ~pyarrow._fs.FileSystem | None = None, filename_fn: ~typing.Callable[[~sycamore.data.document.Document], str] = <function default_filename>, doc_to_bytes_fn: ~typing.Callable[[~sycamore.data.document.Document], bytes] = <function default_doc_to_bytes>, **resource_args) None[source]#

Writes the content of each Document to a separate file.

Parameters:
  • path – The path prefix to write to. Should include the scheme if not local.

  • filesystem – The pyarrow.fs FileSystem to use.

  • filename_fn – A function for generating a file name. Takes a Document and returns a unique name that will be appended to path.

  • doc_to_bytes_fn – A function from a Document to bytes for generating the data to write. Defaults to using text_representation if available, or binary_representation if not.

  • resource_args – Arguments to pass to the underlying execution environment.

json(path: str, filesystem: FileSystem | None = None, **resource_args) None[source]#

Writes Documents in JSONL format to files, one file per block. Typically, a block corresponds to a single pre-explode source document.

Parameters:
  • path – The path prefix to write to. Should include the scheme if not local.

  • filesystem – The pyarrow.fs FileSystem to use.

  • resource_args – Arguments to pass to the underlying execution environment.

opensearch(*, os_client_args: dict, index_name: str, index_settings: dict | None = None, **resource_args) None[source]#

Writes the content of the DocSet into the specified OpenSearch index.

Parameters:

Example

The following code shows how to read a pdf dataset into a DocSet and write it out to a local OpenSearch index called my_index.

os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_auth": ("user", "password"),
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
        },
        "mappings": {
            "properties": {
                "embedding": {
                    "type": "knn_vector",
                    "dimension": 384,
                    "method": {"name": "hnsw", "engine": "faiss"},
                },
            },
        },
    },
}

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())

pdf.write.opensearch(
     os_client_args=os_client_args,
     index_name="my_index",
     index_settings=index_settings)