Extract Schema#

class sycamore.transforms.extract_schema.ExtractBatchSchema(child: Node, schema_extractor: SchemaExtractor, **resource_args)[source]#

Bases: Transform

ExtractBatchSchema is a transformation class for extracting a schema from a dataset using an SchemaExtractor. This assumes all documents in the dataset share a common schema.

If it is more appropriate to provide a unique schema for each document (such as in a hetreogenous PDF collection) consider using ExtractSchema instead.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected. This schema will be the same for all elements of the dataest.

Parameters:
  • child – The source node or component that provides the dataset text for schema suggestion

  • schema_extractor – An instance of an SchemaExtractor class that provides the schema extraction method

  • resource_args – Additional resource-related arguments that can be passed to the extraction operation

Example

custom_schema_extractor = ExampleSchemaExtractor(entity_extraction_params)

documents = ...  # Define a source node or component that provides a dataset with text data.
documents_with_schema = ExtractBatchSchema(child=documents, schema_extractor=custom_schema_extractor)
documents_with_schema = documents_with_schema.execute()
class sycamore.transforms.extract_schema.ExtractProperties(child: Node, property_extractor: PropertyExtractor, **resource_args)[source]#

Bases: Transform

ExtractProperties is a transformation class for extracting property values from a document once a schema has been established.

The schema may be detected by ExtractSchema or provided manually under the _schema key of Document.properties.

Parameters:
  • child – The source node or component that provides the dataset text for schema suggestion

  • property_extractor – An instance of an PropertyExtractor class that provides the property detection method

  • resource_args – Additional resource-related arguments that can be passed to the extraction operation

Example

documents = ...  # Define a source node or component that provides a dataset with text data.
custom_property_extractor = ExamplePropertyExtractor(entity_extraction_params)

documents_with_schema = ...
documents_with_properties = ExtractProperties(
    child=documents_with_schema,
    property_extractor=custom_property_extractor
)
documents_with_properties = documents_with_properties.execute()
class sycamore.transforms.extract_schema.ExtractSchema(child: Node, schema_extractor: SchemaExtractor, **resource_args)[source]#

Bases: Transform

ExtractSchema is a transformation class for extracting schemas from documents using an SchemaExtractor.

This method will extract a unique schema for each document in the DocSet independently. If the documents in the DocSet represent instances with a common schema, consider ExtractBatchSchema which will extract a common schema for all documents.

The dataset is returned with an additional _schema property that contains JSON-encoded schema, if any is detected.

Parameters:
  • child – The source node or component that provides the dataset text for schema suggestion

  • schema_extractor – An instance of an SchemaExtractor class that provides the schema extraction method

  • resource_args – Additional resource-related arguments that can be passed to the extraction operation

Example

custom_schema_extractor = ExampleSchemaExtractor(entity_extraction_params)

documents = ...  # Define a source node or component that provides a dataset with text data.
documents_with_schema = ExtractSchema(child=documents, schema_extractor=custom_schema_extractor)
documents_with_schema = documents_with_schema.execute()
class sycamore.transforms.extract_schema.OpenAIPropertyExtractor(llm: ~sycamore.llms.llms.LLM, num_of_elements: int = 10, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>)[source]#

Bases: PropertyExtractor

OpenAISchema uses one of OpenAI’s language model (LLM) to extract actual property values once a schema has been detected or provided.

Parameters:
  • llm – An instance of an OpenAI language model for text processing.

  • num_of_elements – The number of elements to consider for property extraction. Default is 10.

  • prompt_formatter – A callable function to format prompts based on document elements.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
property_extractor = OpenAIPropertyExtractor(llm=openai, num_of_elements=35)

docs_with_schema = ...
docs_with_schema = docs_with_schema.extract_properties(property_extractor=property_extractor)
class sycamore.transforms.extract_schema.OpenAISchemaExtractor(entity_name: str, llm: ~sycamore.llms.llms.LLM, num_of_elements: int = 35, max_num_properties: int = 7, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element]], str] = <function element_list_formatter>)[source]#

Bases: SchemaExtractor

OpenAISchema uses one of OpenAI’s language model (LLM) for schema extraction, given a suggested entity type to be extracted.

Parameters:
  • entity_name – A natural-language name of the class to be extracted (e.g. Corporation)

  • llm – An instance of an OpenAI language model for text processing.

  • num_of_elements – The number of elements to consider for schema extraction. Default is 10.

  • prompt_formatter – A callable function to format prompts based on document elements.

Example

openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
schema_extractor=OpenAISchemaExtractor("Corporation", llm=openai, num_of_elements=35)

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner())
    .extract_schema(schema_extractor=schema_extractor)