Extract Schema¶

class sycamore.transforms.extract_schema.ExtractBatchSchema(child: Node, schema_extractor: SchemaExtractor, **resource_args)[source]¶

Bases: Map

ExtractBatchSchema is a transformation class for extracting a schema from a dataset using an SchemaExtractor. This assumes all documents in the dataset share a common schema.

If it is more appropriate to provide a unique schema for each document (such as in a hetreogenous PDF collection) consider using ExtractSchema instead.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected. This schema will be the same for all elements of the dataest.

Parameters:

child -- The source node or component that provides the dataset text for schema suggestion
schema_extractor -- An instance of an SchemaExtractor class that provides the schema extraction method
resource_args -- Additional resource-related arguments that can be passed to the extraction operation

Example

custom_schema_extractor = ExampleSchemaExtractor(entity_extraction_params)

documents = ...  # Define a source node or component that provides a dataset with text data.
documents_with_schema = ExtractBatchSchema(child=documents, schema_extractor=custom_schema_extractor)
documents_with_schema = documents_with_schema.execute()

class sycamore.transforms.extract_schema.ExtractSchema(child: Node, schema_extractor: SchemaExtractor, **resource_args)[source]¶

Bases: Map

ExtractSchema is a transformation class for extracting schemas from documents using an SchemaExtractor.

This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.

Parameters:

child -- The source node or component that provides the dataset text for schema suggestion
schema_extractor -- An instance of an SchemaExtractor class that provides the schema extraction method
resource_args -- Additional resource-related arguments that can be passed to the extraction operation

Example

custom_schema_extractor = ExampleSchemaExtractor(entity_extraction_params)

documents = ...  # Define a source node or component that provides a dataset with text data.
documents_with_schema = ExtractSchema(child=documents, schema_extractor=custom_schema_extractor)
documents_with_schema = documents_with_schema.execute()

class sycamore.transforms.extract_schema.LLMPropertyExtractor(llm: ~sycamore.llms.llms.LLM, schema_name: str | None = None, schema: dict | ~sycamore.schema.SchemaV2 | None = None, num_of_elements: int | None = None, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>, metadata_extraction: bool = False, embedder: ~sycamore.transforms.embed.Embedder | None = None, group_size: int | None = None, clustering: bool = True)[source]¶

Bases: PropertyExtractor

The LLMPropertyExtractor uses an LLM to extract actual property values once a schema has been detected or provided.

Parameters:

llm -- An instance of an LLM for text processing.
schema_name -- An optional natural-language name of the class to be extracted (e.g. Corporation) If not provided, will use the _schema_class property added by extract_schema.
schema -- An optional JSON-encoded schema, or Schema object to be used for property extraction. If not provided, will use the _schema property added by extract_schema.
num_of_elements -- The number of elements to consider for property extraction. Default is 10.
prompt_formatter -- A callable function to format prompts based on document elements.

Example

schema_name = "AircraftIncident"
schema = {"location": "string", "aircraft": "string", "date_and_time": "string"}

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
property_extractor = LLMPropertyExtractor(
    llm=openai, schema_name=schema_name, schema=schema, num_of_elements=35
)

docs_with_schema = ...
docs_with_schema = docs_with_schema.extract_properties(property_extractor=property_extractor)

class sycamore.transforms.extract_schema.OpenAIPropertyExtractor(llm: ~sycamore.llms.llms.LLM, schema_name: str | None = None, schema: dict | ~sycamore.schema.SchemaV2 | None = None, num_of_elements: int | None = None, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>, metadata_extraction: bool = False, embedder: ~sycamore.transforms.embed.Embedder | None = None, group_size: int | None = None, clustering: bool = True)[source]¶

Bases: LLMPropertyExtractor

Alias for LLMPropertyExtractor for OpenAI models.

Retained for backward compatibility.

Deprecated since version 0.1.25.

Use LLMPropertyExtractor instead.

class sycamore.transforms.extract_schema.OpenAISchemaExtractor(entity_name: str, llm: ~sycamore.llms.llms.LLM, num_of_elements: int = 35, max_num_properties: int = 7, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>)[source]¶

Bases: LLMSchemaExtractor

Alias for LLMSchemaExtractor for OpenAI models.

Retained for backward compatibility.

Deprecated since version 0.1.25.

Use LLMSchemaExtractor instead.