Extract Table#

class sycamore.transforms.extract_table.CachedTextractTableExtractor(s3_cache_location, run_full_textract: bool = False, s3_textract_upload_path: str = '', profile_name: str | None = None, region_name: str | None = None, kms_key_id: str = '')[source]#

Bases: TextractTableExtractor

Extends TextractTableExtractor with S3 based cache support for raw Textract results.

CachedTextractTableExtractor overrides the 'get_textract_result' method by doing the following:
  1. if cache hit for current document, get from cache and return, otherwise continue

  2. if run_full_textract is enabled, call textractor on the whole document and go to step 4

3. else clip pages which contain tables and run table extraction using textractor 5. update cache accordingly based on textractor result and return result

class sycamore.transforms.extract_table.MissingS3UploadPath[source]#

Bases: Exception

Raised when an S3 upload path is needed but one wasn't provided

class sycamore.transforms.extract_table.TextractTableExtractor(profile_name: str | None = None, region_name: str | None = None, kms_key_id: str = '', s3_upload_root: str = '')[source]#

Bases: TableExtractor

TextractTableExtractor utilizes Amazon Textract to extract tables from documents.

This class inherits from TableExtractor and is designed for extracting tables from documents using Amazon Textract, a cloud-based document text and data extraction service from AWS.

Parameters:
  • profile_name -- The AWS profile name to use for authentication. Default is None.

  • region_name -- The AWS region name where the Textract service is available.

  • kms_key_id -- The AWS Key Management Service (KMS) key ID for encryption.

Example

table_extractor = TextractTableExtractor(profile_name="my-profile", region_name="us-east-1")

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=UnstructuredPdfPartitioner(), table_extractor=table_extractor)