Extract Table¶

class sycamore.transforms.extract_table.CachedTextractTableExtractor(s3_cache_location, run_full_textract: bool = False, s3_textract_upload_path: str = '', profile_name: str | None = None, region_name: str | None = None, kms_key_id: str = '')[source]¶

Bases: TextractTableExtractor

Extends TextractTableExtractor with S3 based cache support for raw Textract results.

CachedTextractTableExtractor overrides the 'get_textract_result' method by doing the following:

if cache hit for current document, get from cache and return, otherwise continue
if run_full_textract is enabled, call textractor on the whole document and go to step 4

3. else clip pages which contain tables and run table extraction using textractor 5. update cache accordingly based on textractor result and return result

class sycamore.transforms.extract_table.MissingS3UploadPath[source]¶

Bases: Exception

Raised when an S3 upload path is needed but one wasn't provided

class sycamore.transforms.extract_table.TextractTableExtractor(profile_name: str | None = None, region_name: str | None = None, kms_key_id: str = '', s3_upload_root: str = '')[source]¶

Bases: TableExtractor

TextractTableExtractor utilizes Amazon Textract to extract tables from documents.

This class inherits from TableExtractor and is designed for extracting tables from documents using Amazon Textract, a cloud-based document text and data extraction service from AWS.

Parameters:

profile_name -- The AWS profile name to use for authentication. Default is None.
region_name -- The AWS region name where the Textract service is available.
kms_key_id -- The AWS Key Management Service (KMS) key ID for encryption.

Example

table_extractor = TextractTableExtractor(profile_name="my-profile", region_name="us-east-1")

context = sycamore.init()
pdf_docset = context.read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner(), table_extractor=table_extractor)