Extract Schema¶
- class sycamore.transforms.extract_schema.ExtractBatchSchema(child: Node, schema_extractor: SchemaExtractor, **resource_args)[source]¶
Bases:
Map
ExtractBatchSchema is a transformation class for extracting a schema from a dataset using an SchemaExtractor. This assumes all documents in the dataset share a common schema.
If it is more appropriate to provide a unique schema for each document (such as in a hetreogenous PDF collection) consider using ExtractSchema instead.
The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected. This schema will be the same for all elements of the dataest.
- Parameters:
child -- The source node or component that provides the dataset text for schema suggestion
schema_extractor -- An instance of an SchemaExtractor class that provides the schema extraction method
resource_args -- Additional resource-related arguments that can be passed to the extraction operation
Example
custom_schema_extractor = ExampleSchemaExtractor(entity_extraction_params) documents = ... # Define a source node or component that provides a dataset with text data. documents_with_schema = ExtractBatchSchema(child=documents, schema_extractor=custom_schema_extractor) documents_with_schema = documents_with_schema.execute()
- class sycamore.transforms.extract_schema.ExtractProperties(child: Node, property_extractor: PropertyExtractor, **resource_args)[source]¶
Bases:
Map
ExtractProperties is a transformation class for extracting property values from a document once a schema has been established.
The schema may be detected by ExtractSchema or provided manually under the _schema key of Document.properties.
- Parameters:
child -- The source node or component that provides the dataset text for schema suggestion
property_extractor -- An instance of an PropertyExtractor class that provides the property detection method
resource_args -- Additional resource-related arguments that can be passed to the extraction operation
Example
documents = ... # Define a source node or component that provides a dataset with text data. custom_property_extractor = ExamplePropertyExtractor(entity_extraction_params) documents_with_schema = ... documents_with_properties = ExtractProperties( child=documents_with_schema, property_extractor=custom_property_extractor ) documents_with_properties = documents_with_properties.execute()
- class sycamore.transforms.extract_schema.ExtractSchema(child: Node, schema_extractor: SchemaExtractor, **resource_args)[source]¶
Bases:
Map
ExtractSchema is a transformation class for extracting schemas from documents using an SchemaExtractor.
This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.
The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.
- Parameters:
child -- The source node or component that provides the dataset text for schema suggestion
schema_extractor -- An instance of an SchemaExtractor class that provides the schema extraction method
resource_args -- Additional resource-related arguments that can be passed to the extraction operation
Example
custom_schema_extractor = ExampleSchemaExtractor(entity_extraction_params) documents = ... # Define a source node or component that provides a dataset with text data. documents_with_schema = ExtractSchema(child=documents, schema_extractor=custom_schema_extractor) documents_with_schema = documents_with_schema.execute()
- class sycamore.transforms.extract_schema.OpenAIPropertyExtractor(llm: ~sycamore.llms.llms.LLM, schema_name: str | None = None, schema: dict[str, str] | None = None, num_of_elements: int = 10, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>)[source]¶
Bases:
LLMPropertyExtractor
Alias for LLMPropertyExtractor for OpenAI models.
Retained for backward compatibility.
Deprecated since version 0.1.25.
Use LLMPropertyExtractor instead.
- class sycamore.transforms.extract_schema.OpenAISchemaExtractor(entity_name: str, llm: ~sycamore.llms.llms.LLM, num_of_elements: int = 35, max_num_properties: int = 7, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>)[source]¶
Bases:
LLMSchemaExtractor
Alias for LLMSchemaExtractor for OpenAI models.
Retained for backward compatibility.
Deprecated since version 0.1.25.
Use LLMSchemaExtractor instead.