Elasticsearch#
Elasticsearch is a full-featured, multitenant-capable full-text search engine. Elasticsearch makes it easy to build hybrid search applications with extensive in-built functionality and support for RAG tooling.
Configuration for Elasticsearch#
Please see Elasticsearch's installation page for more in-depth information on installing, configuring, and running Elasticsearch. We specify the setup required to run a simple demo app.
For local development and testing, we recommend running Elasticsearch through docker compose. The provided compose.yml
file runs Elasticsearch, which has an associated low-level Python library that makes querying easier.
compose.yml
version: "3.8"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.14.2
ports:
- 9200:9200
restart: on-failure
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms4g -Xmx4g
ulimits:
memlock:
soft: -1
hard: -1
With this you can run Elasticsearch with a simple docker compose up
.
Writing to Elasticsearch#
To write a DocSet to a Elasticsearch index from Sycamore, use the docset.write.elasticsearch(...)
function. The Elasticsearch writer takes the following arguments:
url
: Connection endpoint for the Elasticsearch instance. Note that this must be paired with the necessary client arguments belowindex_name
: Index name to write to in the Elasticsearch instancees_client_args
: (optional) Authentication arguments to be specified (if needed). See more information at here.wait_for_completion
: (optional, default="false"
) Whether to wait for completion of the write before proceeding with next steps. See more information and valid values [here] (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html).mappings
: (optional) Mappings of the Elasticsearch index, can be optionally specifiedsettings
:(optional) Settings of the Elasticsearch index, can be optionally specifiedexecute
: (optional, default=True
) Whether to execute this sycamore pipeline now, or return a docset to add more transforms.
To write a docset to a local Elasticsearch index run by the Docker compose above, we can write the following:
url = "http://localhost:9200"
index_name = "test_index-write"
wait_for_completion = "wait_for"
ds.write.elasticsearch(url=url, index_name=index_name, wait_for_completion=wait_for_completion)
More information can be found in the API documentation. A demo of the writer can also be found in the demo notebook.
Reading from Elasticsearch#
Reading from an Elasticsearch index takes in the index_name
, url
, and es_client_args
arguments, with the same specification and defaults as above. It paginates and reads from all search results. It also takes in the arguments below:
query
: (Optional) Query to perform on the index. Note that this must be specified in the Elasticsearch Query DSL as a dictionary. Otherwise, it defaults to a full scan of the table. See more information [here] (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).keep_alive
: (Optional) Keep alive time for the search context point in time. Defaults to 1 minute if not specifiedkwargs
: (Optional) Parameters to configure the underlying Elasticsearch search query. See more information [here] (https://elasticsearch-py.readthedocs.io/en/v8.14.0/api/elasticsearch.html#elasticsearch.Elasticsearch.search).
To read from a Elasticsearch index into a Sycamore DocSet, use the following code:
ctx = sycamore.init()
url = "http://localhost:9200"
index_name = "test_index-read"
target_doc_id = "target"
query_params = {"term": {"_id": target_doc_id}}
query_docs = ctx.read.elasticsearch(url=url, index_name=index_name, query=query_params).take_all()
More information can be found in the API documentation.