DocSet#
- class sycamore.docset.DocSet(context: Context, plan: Node)[source]#
A DocSet, short for “Document Set”, is a distributed collection of documents bundled together for processing. Sycamore provides a variety of transformations on DocSets to help customers handle unstructured data easily.
- augment_text(augmentor: TextAugmentor, **resource_args) DocSet [source]#
Augments text_representation with external information.
- Parameters:
augmentor (TextAugmentor) -- A TextAugmentor instance that defines how to augment the text
Example
augmentor = FStringTextAugmentor(sentences = [ "This pertains to the part {doc.properties['part_name']}.", "{doc.text_representation}" ]) entity_extractor = OpenAIEntityExtractor("part_name", llm=openai_llm, prompt_template=part_name_template) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .extract_entity(entity_extractor) .explode() .augment_text(augmentor)
- clear_materialize(path: Path | str | None = None, *, clear_non_local=False) None [source]#
Deletes all of the materialized files referenced by the docset.
path will use PurePath.match to check if the specified path matches against the directory used for each materialize transform. Only matching directories will be cleared.
Set clear_non_local=True to clear non-local filesystems. Note filesystems like NFS/CIFS will count as local. pyarrow.fs.SubTreeFileSystem is treated as non_local.
- count(include_metadata=False, **kwargs) int [source]#
Counts the number of documents in the resulting dataset. It is a convenient way to determine the size of the dataset generated by the plan.
- Parameters:
include_metadata -- Determines whether or not to count MetaDataDocuments
**kwargs --
- Returns:
The number of documents in the docset.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .count()
- count_distinct(field: str, **kwargs) int [source]#
Counts the number of documents in the resulting dataset with a unique value for field.
Args: field: Field (in dotted notation) to perform a unique count based on. **kwargs
- Returns:
The number of documents with a unique value for field.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .count("doc_id")
- embed(embedder: Embedder, **kwargs) DocSet [source]#
Applies the Embed transform on the Docset.
- Parameters:
embedder -- An instance of an Embedder class that defines the embedding method to be applied.
Example
model_name="sentence-transformers/all-MiniLM-L6-v2" embedder = SentenceTransformerEmbedder(batch_size=100, model_name=model_name) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .explode() .embed(embedder=embedder)
- execute(**kwargs) None [source]#
Execute the pipeline, discard the results. Useful for side effects.
- explode(**resource_args) DocSet [source]#
Applies the Explode transform on the Docset.
Example
pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .explode()
- extract_batch_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet [source]#
Extracts a common schema from the documents in this DocSet.
This transform is similar to extract_schema, except that it will add the same schema to each document in the DocSet rather than infering a separate schema per Document. This is most suitable for document collections that share a common format. If you have a heterogeneous document collection and want a different schema for each type, consider using extract_schema instead.
- Parameters:
schema_extractor -- A SchemaExtractor instance to extract the schema for each document.
Example
openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .extract_batch_schema(schema_extractor=schema_extractor)
- extract_document_structure(structure: DocumentStructure, **kwargs)[source]#
Represents documents as Hierarchical documents organized by their structure.
- Parameters:
structure -- A instance of DocumentStructure which determines how documents are organized
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .extract_document_structure(structure=StructureBySection) .explode()
- extract_entity(entity_extractor: EntityExtractor, **kwargs) DocSet [source]#
Applies the ExtractEntity transform on the Docset.
- Parameters:
entity_extractor -- An instance of an EntityExtractor class that defines the entity extraction method to be applied.
Example
title_context_template = "template" openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) entity_extractor = OpenAIEntityExtractor("title", llm=openai_llm, prompt_template=title_context_template) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .extract_entity(entity_extractor=entity_extractor)
- extract_graph_entities(extractors: list[GraphEntityExtractor] = [], **kwargs) DocSet [source]#
Extracts entites from document children. Entities are stored as nodes within each child of a document.
- Parameters:
extractors -- A list of GraphEntityExtractor objects which determines how entities are extracted
Example
- extract_graph_relationships(extractors: list[GraphRelationshipExtractor] = [], **kwargs) DocSet [source]#
Extracts relationships from document children. Relationships are stored within the nodes they reference within each child of a document.
- Parameters:
extractors -- A list of GraphEntityExtractor objects which determines how relationships are extracted
Example
- extract_properties(property_extractor: PropertyExtractor, **kwargs) DocSet [source]#
Extracts properties from each Document in this DocSet based on the _schema property.
The schema can be computed using extract_schema or extract_batch_schema or can be provided manually in JSON-schema format in the _schema field under Document.properties.
Example
openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) property_extractor = OpenAIPropertyExtractor(OpenaAIPropertyExtrator(llm=openai_llm)) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partition=ArynPartitioner()) .extract_properties(property_extractor)
- extract_schema(schema_extractor: SchemaExtractor, **kwargs) DocSet [source]#
Extracts a JSON schema of extractable properties from each document in this DocSet.
Each schema is a mapping of names to types that corresponds to fields that are present in the document. For example, calling this method on a financial document containing information about companies might yield a schema like
{ "company_name": "string", "revenue": "number", "CEO": "string" }
This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.
The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.
- Parameters:
schema_extractor -- A SchemaExtractor instance to extract the schema for each document.
Example
openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .extract_schema(schema_extractor=schema_extractor)
- field_in(docset2: DocSet, field1: str, field2: str, **kwargs) DocSet [source]#
Joins two docsets based on specified fields; docset (self) filtered based on values of docset2.
SQL Equivalent: SELECT * FROM docset1 WHERE field1 IN (SELECT field2 FROM docset2);
- Parameters:
docset2 -- DocSet to filter.
field1 -- Field in docset1 to filter based on.
field2 -- Field in docset2 to filter.
- Returns:
A left semi-join between docset (self) and docset2.
- filter(f: Callable[[Document], bool], **kwargs) DocSet [source]#
Applies the Filter transform on the Docset.
- Parameters:
f -- A callable function that takes a Document object and returns a boolean indicating whether the document should be included in the filtered Docset.
Example
def custom_filter(doc: Document) -> bool: # Define your custom filtering logic here. return doc.some_property == some_value context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .filter(custom_filter)
- filter_elements(f: Callable[[Element], bool], **resource_args) DocSet [source]#
Applies the given filter function to each element in each Document in this DocsSet.
- Parameters:
f -- A Callable that takes an Element and returns True if the element should be retained.
- flat_map(f: Callable[[Document], list[Document]], **resource_args) DocSet [source]#
Applies the FlatMap transformation on the Docset.
- Parameters:
f -- The function to apply to each document.
See the
FlatMap
documentation for advanced features.Example
def custom_flat_mapping_function(document: Document) -> list[Document]: # Custom logic to transform the document and return a list of documents return [transformed_document_1, transformed_document_2] context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .flat_map(custom_flat_mapping_function)
- groupby_count(field: str, unique_field: str | None = None, **kwargs) DocSet [source]#
Performs a count aggregation on a DocSet.
- Parameters:
field -- Field to aggregate based on.
unique_field -- Determines what makes a unique document.
**kwargs --
- Returns:
A DocSet with "properties.key" (unique values of document field) and "properties.count" (frequency counts for unique values).
- limit(limit: int = 20, **kwargs) DocSet [source]#
Applies the Limit transforms on the Docset.
- Parameters:
limit -- The maximum number of documents to include in the resulting Docset.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .explode() .limit()
- llm_cluster_entity(llm: LLM, instruction: str, field: str, **kwargs) DocSet [source]#
Normalizes a particular field of a DocSet. Identifies and assigns each document to a "group".
- Parameters:
llm -- LLM client.
instruction -- Instruction about groups to form, e.g. 'Form groups for different types of food'
field -- Field to make/assign groups based on, e.g. 'properties.entity.food'
- Returns:
A DocSet with an additional field "properties._autogen_ClusterAssignment" that contains the assigned group. For example, if "properties.entity.food" has values 'banana', 'milk', 'yogurt', 'chocolate', 'orange', "properties._autogen_ClusterAssignment" would contain values like 'fruit', 'dairy', and 'dessert'.
- llm_filter(llm: LLM, new_field: str, prompt: list[dict] | str, field: str = 'text_representation', threshold: int = 3, keep_none: bool = False, use_elements: bool = False, similarity_query: str | None = None, similarity_scorer: SimilarityScorer | None = None, **resource_args) DocSet [source]#
Filters DocSet to only keep documents that score (determined by LLM) greater than or equal to the inputted threshold value.
- Parameters:
llm -- LLM to use.
new_field -- The field that will be added to the DocSet with the outputs.
prompt -- LLM prompt.
field -- Document field to filter based on.
threshold -- If the value of the computed result is an integer value greater than or equal to this threshold, the document will be kept.
keep_none -- keep records with a None value for the provided field to filter on. Warning: using this might hide data corruption issues.
use_elements -- use contents of a document's elements to filter as opposed to document level contents.
similarity_query -- query string to compute similarity against. Also requires a 'similarity_scorer'.
similarity_scorer -- scorer used to generate similarity scores used in element sorting. Also requires a 'similarity_query'.
**resource_args --
- Returns:
A filtered DocSet.
- llm_query(query_agent: LLMTextQueryAgent, **kwargs) DocSet [source]#
Executes an LLM Query on a specified field (element or document), and returns the response
- Parameters:
prompt -- A prompt to be passed into the underlying LLM execution engine
llm -- The LLM Client to be used here. It is defined as an instance of the LLM class in Sycamore.
output_property -- (Optional, default="llm_response") The output property of the document or element to add results in.
format_kwargs -- (Optional, default="None") If passed in, details the formatting details that must be passed into the underlying Jinja Sandbox.
number_of_elements -- (Optional, default="None") When "per_element" is true, limits the number of elements to add an "output_property". Otherwise, the response is added to the entire document using a limited prefix subset of the elements.
llm_kwargs -- (Optional) LLM keyword argument for the underlying execution engine
per_element -- (Optional, default="{}") Keyword arguments to be passed into the underlying LLM execution engine.
element_type -- (Optional) Parameter to only execute the LLM query on a particular element type. If not specified, the query will be executed on all elements.
- map(f: Callable[[Document], Document], **resource_args) DocSet [source]#
Applies the Map transformation on the Docset.
- Parameters:
f -- The function to apply to each document.
See the
Map
documentation for advanced features.
- map_batch(f: Callable[[list[Document]], list[Document]], f_args: Iterable[Any] | None = None, f_kwargs: dict[str, Any] | None = None, f_constructor_args: Iterable[Any] | None = None, f_constructor_kwargs: dict[str, Any] | None = None, **resource_args) DocSet [source]#
The map_batch transform is similar to map, except that it processes a list of documents and returns a list of documents. map_batch is ideal for transformations that get performance benefits from batching.
See the
MapBatch
documentation for advanced features.Example
def custom_map_batch_function(documents: list[Document]) -> list[Document]: # Custom logic to transform the documents return transformed_documents map_ds = input_ds.map_batch(f=custom_map_batch_function) def CustomMappingClass(): def __init__(self, arg1, arg2, *, kw_arg1=None, kw_arg2=None): self.arg1 = arg1 # ... def _process(self, doc: Document) -> Document: doc.properties["arg1"] = self.arg1 return doc def __call__(self, docs: list[Document], fnarg1, *, fnkwarg1=None) -> list[Document]: return [self._process(d) for d in docs] map_ds = input_ds.map_batch(f=CustomMappingClass, f_args=["fnarg1"], f_kwargs={"fnkwarg1": "stuff"}, f_constructor_args=["arg1", "arg2"], f_constructor_kwargs={"kw_arg1": 1, "kw_arg2": 2})
- map_elements(f: Callable[[Element], Element], **resource_args) DocSet [source]#
Applies the given mapping function to each element in the each Document in this DocsSet.
- Parameters:
f -- A Callable that takes an Element and returns an Element. Elements for which f evaluates to None are dropped.
- mark_bbox_preset(tokenizer: Tokenizer, token_limit: int = 512, **kwargs) DocSet [source]#
- Convenience composition of:
SortByPageBbox MarkDropTiny minimum=2 MarkDropHeaderFooter top=0.05 bottom=0.05 MarkBreakPage MarkBreakByColumn MarkBreakByTokens limit=512
Meant to work in concert with MarkedMerger.
- markdown(**kwargs) DocSet [source]#
Modifies Document to have a single Element containing the Markdown representation of all the original elements.
Example
context = sycamore.init() ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .markdown() .explode()
- materialize(path: Path | str | dict | None = None, source_mode: MaterializeSourceMode = MaterializeSourceMode.RECOMPUTE) DocSet [source]#
The materialize transform writes out documents up to that point, marks the materialized path as successful if execution is successful, and allows for reading from the materialized data as a source. This transform is helpful if you are using show and take() as part of a notebook to incrementally inspect output. You can use materialize to avoid re-computation.
- path: a Path or string represents the "directory" for the materialized elements. The filesystem
and naming convention will be inferred. The dictionary variant allowes finer control, and supports { root=Path|str, fs=pyarrow.fs, name=lambda Document -> str, clean=True,
tobin=Document.serialize()}
root is required
- source_mode: how this materialize step should be used as an input:
- RECOMPUTE: (default) the transform does not act as a source, previous transforms
will be recomputed.
- USE_STORED: If the materialize has successfully run to completion, or if the
materialize step has no prior step, use the stored contents of the directory as the inputs. No previous transform will be computed. WARNING: If you change the input files or any of the steps before the materialize step, you need to use clear_materialize() or change the source_mode to force re-execution.
Note: you can write the source mode as MaterializeSourceMode.SOMETHING after importing MaterializeSourceMode, or as sycamore.MATERIALIZE_SOMETHING after importing sycamore.
- merge(merger: ElementMerger, **kwargs) DocSet [source]#
Applies merge operation on each list of elements of the Docset
Example
from transformers import AutoTokenizer tk = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") merger = GreedyElementMerger(tk, 512) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .merge(merger=merger)
- partition(partitioner: Partitioner, table_extractor: TableExtractor | None = None, **kwargs) DocSet [source]#
Applies the Partition transform on the Docset.
More information can be found in the Partition documentation.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner())
- query(query_executor: QueryExecutor, **resource_args) DocSet [source]#
Applies a query execution transform on a DocSet of queries.
- Parameters:
query_executor -- Implementation for the query execution.
- random_sample(fraction: float, seed: int | None = None) DocSet [source]#
Retain a random sample of documents from this DocSet.
The number of documents in the output will be approximately fraction * self.count()
- Parameters:
fraction -- The fraction of documents to retain.
seed -- Optional seed to use for the RNG.
- regex_replace(spec: list[tuple[str, str]], **kwargs) DocSet [source]#
Performs regular expression replacement (using re.sub()) on the text_representation of every Element in each Document.
Example
from sycamore.transforms import COALESCE_WHITESPACE ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .regex_replace(COALESCE_WHITESPACE) .regex_replace([(r"\d+", "1313"), (r"old", "new")]) .explode()
- rerank(similarity_scorer: SimilarityScorer, query: str, score_property_name: str = '_rerank_score', limit: int | None = None, **kwargs) DocSet [source]#
Sort a DocSet given a scoring class.
- Parameters:
similarity_scorer -- An instance of an SimilarityScorer class that executes the scoring function.
query -- The query string to compute similarity against.
score_property_name -- The name of the key where the score will be stored in document.properties
limit -- Limit scoring and sorting to fixed size.
- resolve_graph_entities(resolvers: list[EntityResolver] = [], resolve_duplicates=True, **kwargs) DocSet [source]#
Resolves graph entities across documents so that duplicate entities can be resolved to the same entity based off criteria of EntityResolver objects.
- Parameters:
resolvers -- A list of EntityResolvers that are used to determine what entities are duplicates
resolve_duplicates -- If exact duplicate entities and relationships should be merged. Defaults to true
Example
- show(limit: int = 20, show_elements: bool = True, num_elements: int = -1, show_binary: bool = False, show_embedding: bool = False, truncate_content: bool = True, truncate_length: int = 100, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None [source]#
Prints the content of the docset in a human-readable format. It is useful for debugging and inspecting the contents of objects during development.
- Parameters:
limit -- The maximum number of items to display.
show_elements -- Whether to display individual elements or not.
num_elements -- The number of elements to display. Use -1 to show all elements.
show_binary -- Whether to display binary data or not.
show_embedding -- Whether to display embedding information or not.
truncate_content -- Whether to truncate long content when displaying.
truncate_length -- The maximum length of content to display when truncating.
stream -- The output stream where the information will be displayed.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .show()
- sketch(window: int = 17, number: int = 16, **kwargs) DocSet [source]#
For each Document, uses shingling to hash sliding windows of the text_representation. The set of shingles is called the sketch. Documents' sketches can be compared to determine if they have near-duplicate content.
- Parameters:
window -- Number of bytes in the sliding window that is hashed (17)
number -- Count of hashes comprising a shingle (16)
Example
ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .explode() .sketch(window=17)
- sort(descending: bool, field: str, default_val: Any | None = None) DocSet [source]#
Sort DocSet by specified field.
- Parameters:
descending -- Whether or not to sort in descending order (first to last).
field -- Document field in relation to Document using dotted notation, e.g. properties.filetype
default_val -- Default value to use if field does not exist in Document
- split_elements(tokenizer: Tokenizer, max_tokens: int = 512, **kwargs) DocSet [source]#
Splits elements if they are larger than the maximum number of tokens.
Example
pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .split_elements(tokenizer=tokenizer, max_tokens=512) .explode()
- spread_properties(props: list[str], **resource_args) DocSet [source]#
Copies listed properties from parent document to child elements.
Example
pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .spread_properties(["title"]) .explode()
- summarize(summarizer: Summarizer, **kwargs) DocSet [source]#
Applies the Summarize transform on the Docset.
Example
llm_model = OpenAILanguageModel("gpt-3.5-turbo") element_operator = my_element_selector # A custom element selection function summarizer = LLMElementTextSummarizer(llm_model, element_operator) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .summarize(summarizer=summarizer)
- take(limit: int = 20, include_metadata: bool = False, **kwargs) list[Document] [source]#
Returns up to
limit
documents from the dataset.- Parameters:
limit -- The maximum number of Documents to return.
- Returns:
A list of up to
limit
Documents from the Docset.
Example
context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .take()
- take_all(limit: int | None = None, include_metadata: bool = False, **kwargs) list[Document] [source]#
Returns all of the rows in this DocSet.
If limit is set, this method will raise an error if this Docset has more than limit Documents.
- Parameters:
limit -- The number of Documents above which this method will raise an error.
- term_frequency(tokenizer: Tokenizer, with_token_ids: bool = False, **kwargs) DocSet [source]#
For each document, compute a frequency table over the text representation, as tokenized by tokenizer. Use to enable hybrid search in Pinecone
Example
tk = OpenAITokenizer("gpt-3.5-turbo") context = sycamore.init() context.read.binary(paths, binary_format="pdf") .partition(ArynPartitioner()) .explode() .term_frequency(tokenizer=tk) .show()
- top_k(llm: LLM | None, field: str, k: int | None, descending: bool = True, llm_cluster: bool = False, unique_field: str | None = None, llm_cluster_instruction: str | None = None, **kwargs) DocSet [source]#
Determines the top k occurrences for a document field.
- Parameters:
llm -- LLM client.
field -- Field to determine top k occurrences of.
k -- Number of top occurrences. If k is not specified, all occurences are returned.
llm_cluster_instruction -- Instruction of operation purpose. E.g. Find most common cities
descending -- Indicates whether to return most or least frequent occurrences.
llm_cluster -- Indicates whether an LLM should be used to normalize values of document field.
unique_field -- Determines what makes a unique document.
**kwargs --
- Returns:
A DocSet with "properties.key" (unique values of document field) and "properties.count" (frequency counts for unique values) which is sorted based on descending and contains k records.
- transform(cls: Type[Transform], **kwargs) DocSet [source]#
Add specified transform class to pipeline. See the API reference section on transforms.
- Parameters:
cls -- Class of transform to instantiate into pipeline
... -- Other keyword arguments are passed to class constructor
Example
from sycamore.transforms import FooBar ds = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) .transform(cls=FooBar, arg=123)
- with_properties(property_map: Mapping[str, Callable[[Document], Any]], **resource_args) DocSet [source]#
Adds multiple properties to each Document.
- Parameters:
property_map -- A mapping of property names to functions to generate those properties
Example
docset.with_properties({ "text_size": lambda doc: len(doc.text_representation), "truncated_text": lambda doc: doc.text_representation[0:256] })
- with_property(name, f: Callable[[Document], Any], **resource_args) DocSet [source]#
Applies a function to each document and adds the result as a property.
- Parameters:
name -- The name of the property to add to each Document.
f -- The function to apply to each Document.
Example
To add a property that contains the length of the text representation as a new property: .. code-block:: python
docset.with_property("text_size", lambda doc: len(doc.text_representation))
- property write: DocSetWriter#
Exposes an interface for writing a DocSet to OpenSearch or other external storage. See
DocSetWriter
for more information about writers and their arguments.Example
The following example shows reading a DocSet from a collection of PDFs, partitioning it using the
ArynPartitioner
, and then writing it to a new OpenSearch index.os_client_args = { "hosts": [{"host": "localhost", "port": 9200}], "http_auth": ("user", "password"), } index_settings = { "body": { "settings": { "index.knn": True, }, "mappings": { "properties": { "embedding": { "type": "knn_vector", "dimension": 384, "method": {"name": "hnsw", "engine": "faiss"}, }, }, }, }, } context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=ArynPartitioner()) pdf.write.opensearch( os_client_args=os_client_args, index_name="my_index", index_settings=index_settings)