Querying Data with Sycamore¶
Beyond using Sycamore for document processing and loading data stores, Sycamore can also be used to implement sophisticated query pipelines over unstructured, semi-structured, and structured data. These query pipelines can use the full range of Sycamore operations to transform, filter, and aggregate data in a variety of ways. This provides a powerful abstraction that goes beyond conventional query languages (like SQL) and LLM-based data retrieval techniques (like RAG).
Overview¶
Sycamore Query consists of a few components, which are found in the
sycamore.query package, {doc}`documented here </sycamore/APIs/query>`.
The SycamoreQueryClient class is the primary interface to the Sycamore
Query engine. It is configured with a pointer to an underlying data source
(currently, an OpenSearch index). The SycamoreQueryClient.query() method
allows one to query the data source using a natural-language query, getting
a Sycamore DocSet as a result.
Here is a simple example:
from sycamore.query.client import SycamoreQueryClient
# If no data source is specified, the OpenSearch server running on localhost:9200 is used.
client = SycamoreQueryClient()
# Generate a query plan and run it for the given OpenSearch index.
result = client.query(
query="How many incidents were reported in bad weather conditions?",
index="const_ntsb")
Under the hood, this is using an LLM to generate a query plan from the
natural-language query, which consists of a pipeline of Sycamore operators
that retrieve, transform, and aggregate data. One can also use the
LogicalPlan class to build query plans directly, without the help of an LLM.
The range of operators supported by Sycamore Query is quite broad, including
filtering, aggregation, group-by, count, sorting, and mathematical operations.
Sycamore Query also supports LLM-powered query operators, such as
LlmFilter, which uses an LLM to filter data, and SummarizeData, which
takes a Sycamore DocSet as input, and uses the LLM to produce a natural language
response given a prompt.
One can think of Sycamore Query as a much more sophisticated form of RAG, using the power of Sycamore's data-processing operations alongside the semantic power of an LLM, allowing you to run queries that would be impossible for a RAG system to answer reliably.
The Sycamore Query UI¶
The directory apps/query-ui in the Sycamore tree contains a web-based UI to
Sycamore Query, making it easy to experiment with queries, inspect query plans,
and debug the results. To run it, simply run the following in a checkout of
the Sycamore source tree:
cd apps/query-ui
poetry install
poetry run queryui/main.py
By default, the UI will query OpenSearch running locally on port 9200.
Sycamore Query Plans¶
The query plans generated by Sycamore Query are instances of the LogicalPlan
class and represent a tree of operators that fetch, filter, aggregate, or process data
to produce a final result. You can inspect the query plan generated for a given query
in the UI (as described above) or by using the SycamoreQueryClient.generate_plan()
method.
For example, a query plan for a query such as "What is the breakdown of aircraft types for incidents with substantial damage? might be as follows:
{
"nodes": {
"0": QueryDatabase(
node_id=0,
description="Get all the incident reports with substantial aircraft damage",
input=None,
index="const_ntsb",
query={"match": {"properties.entity.aircraftDamage": "Substantial"}}
),
"1": TopK(
node_id=1,
description="Get the breakdown of aircraft types",
input=[0],
field="properties.entity.aircraft",
primary_field="properties.entity.accidentNumber",
K=100,
descending=False)
}
}
Essentially this is querying OpenSearch for records matching the given OpenSearch query,
and feeding the results to a TopK operation that performs a group-by on the
aircraft type.
SycamoreQueryClient uses an LLM and knowledge of the OpenSearch
index schema to generate these query plans automatically from natural-language queries.
However, you can also construct a LogicalPlan directly, in code, and pass it to
SycamoreQueryClient.run_plan() to run it.
Caching and performance¶
If you are running multiple queries that use the same intermediate results, Sycamore Query can cache those intermediate results to avoid recomputing them for subsequent queries. This is helpful from a performance and LLM cost perspective.
To use this feature, pass the cache_dir option to SycamoreQueryClient:
client = SycamoreQueryClient(cache_dir="/path/to/cache/dir")
Intermediate query results will be written to this directory and reused for
subsequent queries using the same cache_dir setting. If you wish to invalidate
the cache, simply remove the contents of your cache_dir.
Sycamore Query can also cache the results of LLM calls, saving time and money when
many LLM operations are being performed. To use this feature, pass the
s3_cache_path option to SycamoreQueryClient.
client = SycamoreQueryClient(s3_cache_path="/path/to/llm_cache/dir")
(Note that the name of this flag is a misnomer; it need not be an S3 path.)
The cache_dir and s3_cache_path settings can either be local filesystem
paths, or locations of S3 buckets (e.g., s3://your-bucket/query-cache).
Debugging query execution¶
Sycamore Query will write the output of each node of the query plan as it runs
to a trace directory that you specify, allowing you to inspect the results as they
flow through the query plan. To use this feature, pass the trace_dir
to SycamoreQueryClient:
client = SycamoreQueryClient(trace_dir="/path/to/trace/dir")
The contents of the trace_dir will be populated with files containing
the output of each query operator (note that these can be quite large, depending
on the amount of data you are querying). The layout of the directory will be:
trace_dir/
<query_id>/
<node_id>/
doc-<doc_uuid_1>.pickle
doc-<doc_uuid_2>.pickle
...
where <query_id> is a unique ID representing the query that was executed,
<node_id> is the node ID in the query plan, and <doc_uuid_NNNN> is a unique
ID for each document in the DocSet that was emitted by that node in the query plan.
These are Python pickle files containing the contents of each Sycamore Document emitted by
the corresponding query node. You can read them back with code like the following:
import os
from sycamore.data import Document
docs = {}
for node_id in os.listdir(trace_dir):
docs[node_id] = []
for filename in os.listdir(os.path.join(trace_dir, node_id)):
f = os.path.join(trace_dir, node_id, filename)
with open(f, "rb") as file:
doc = Document.deserialize(f.read())
docs[node_id].append(doc)