OpenSearch#

OpenSearch is an open-source flexible, scalable full-text search engine that is based off a 2021 fork of Elasticsearch. OpenSearch makes it easy to build hybrid search applications with clear in-built functionality and strucutre.

Configuration for OpenSearch#

Please see OpenSearch's installation page for more in-depth information on installing, configuring, and running OpenSearch. We specify the setup required to run a simple demo app.

For local development and testing, we recommend running OpenSearch through docker compose. The provided compose.yml file runs OpenSearch, which has an associated low-level Python library that makes querying easier.

compose.yml
version: '3'
services:
  opensearch:
    image: opensearchproject/opensearch:2.10.0
    container_name: opensearch
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
    ports:
      - 9200:9200 # REST API

With this you can run OpenSearch with a simple docker compose up.

Writing to OpenSearch#

To write a DocSet to a OpenSearch index from Sycamore, use the docset.write.opensearch(...) function. The OpenSearch writer takes the following arguments:

  • os_client_args: Keyword parameters that are passed to the opensearch-py OpenSearch client constructor.

  • index_name: The name of the OpenSearch index into which to load this DocSet.

  • index_settings: Settings and mappings to pass when creating a new index. Specified as a Python dict corresponding to the JSON paramters taken by the OpenSearch CreateIndex API, more information is given here.

  • execute: (optional, default=True) Whether to execute this sycamore pipeline now, or return a docset to add more transforms.

To write a docset to the OpenSearch index run by the Docker compose above, we can write the following:

index_name = "test_index-other"

os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_auth": ("user", "password"),
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
        },
        "mappings": {
            "properties": {
                "embedding": {
                    "type": "knn_vector",
                    "dimension": 384,
                    "method": {"name": "hnsw", "engine": "faiss"},
                },
            },
        },
    },
}
docset.write.opensearch(
    os_client_args=os_client_args,
    index_name=index_name,
    index_settings=index_settings,
)

More information can be found in the API documentation. A demo of the writer can also be found in the demo notebook.

Reading from OpenSearch#

In addition to the os_client_args and index_name arguments above, reading from OpenSearch takes in an optional query parameter, which takes in a dictionary using the OpenSearch query DSL (further information is given here). Note that if the parameter is not specified, the function will return a full scan of all documents in the index.

ctx = sycamore.init()
ctx.read.opensearch(os_client_args=os_client_args, index_name=index_name, query={"query": {"term": {"_id": "SAMPLE-DOC-ID"}}})

More information can be found in the API documentation.