Extract Entity#
- class sycamore.transforms.extract_entity.ExtractEntity(child: Node, entity_extractor: EntityExtractor, context: Context | None = None, **resource_args)[source]#
Bases:
Map
ExtractEntity is a transformation class for extracting entities from a dataset using an EntityExtractor.
The Extract Entity Transform extracts semantically meaningful information from your documents.These extracted entities are then incorporated as properties into the document structure.
- Parameters:
child -- The source node or component that provides the dataset containing text data.
entity_extractor -- An instance of an EntityExtractor class that defines the entity extraction method to be
applied. --
resource_args -- Additional resource-related arguments that can be passed to the extraction operation.
Example
source_node = ... # Define a source node or component that provides a dataset with text data. custom_entity_extractor = MyEntityExtractor(entity_extraction_params) extraction_transform = ExtractEntity(child=source_node, entity_extractor=custom_entity_extractor) extracted_entities_dataset = extraction_transform.execute()
- class sycamore.transforms.extract_entity.OpenAIEntityExtractor(entity_name: str, llm: ~sycamore.llms.llms.LLM | None = None, prompt_template: str | None = None, num_of_elements: int = 10, prompt_formatter: ~typing.Callable[[list[~sycamore.data.element.Element], str], str] = <function element_list_formatter>, use_elements: bool | None = True, prompt: list[dict] | str | None = [], field: str = 'text_representation')[source]#
Bases:
EntityExtractor
OpenAIEntityExtractor uses one of OpenAI's language model (LLM) for entity extraction.
This class inherits from EntityExtractor and is designed for extracting a specific entity from a document using OpenAI's language model. It can use either zero-shot prompting or few-shot prompting to extract the entity. The extracted entities from the input document are put into the document properties.
- Parameters:
entity_name -- The name of the entity to be extracted.
llm -- An instance of an OpenAI language model for text processing.
prompt_template -- A template for constructing prompts for few-shot prompting. Default is None.
num_of_elements -- The number of elements to consider for entity extraction. Default is 10.
prompt_formatter -- A callable function to format prompts based on document elements.
Example
title_context_template = "template" openai_llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value) entity_extractor = OpenAIEntityExtractor("title", llm=openai_llm, prompt_template=title_context_template) context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=UnstructuredPdfPartitioner()) .extract_entity(entity_extractor=entity_extractor)