Extract Table#
- class sycamore.transforms.extract_table.CachedTextractTableExtractor(s3_cache_location, run_full_textract: bool = False, s3_textract_upload_path: str = '', profile_name: str | None = None, region_name: str | None = None, kms_key_id: str = '')[source]#
Bases:
TextractTableExtractor
Extends TextractTableExtractor with S3 based cache support for raw Textract results.
- CachedTextractTableExtractor overrides the 'get_textract_result' method by doing the following:
if cache hit for current document, get from cache and return, otherwise continue
if run_full_textract is enabled, call textractor on the whole document and go to step 4
3. else clip pages which contain tables and run table extraction using textractor 5. update cache accordingly based on textractor result and return result
- class sycamore.transforms.extract_table.MissingS3UploadPath[source]#
Bases:
Exception
Raised when an S3 upload path is needed but one wasn't provided
- class sycamore.transforms.extract_table.TextractTableExtractor(profile_name: str | None = None, region_name: str | None = None, kms_key_id: str = '', s3_upload_root: str = '')[source]#
Bases:
TableExtractor
TextractTableExtractor utilizes Amazon Textract to extract tables from documents.
This class inherits from TableExtractor and is designed for extracting tables from documents using Amazon Textract, a cloud-based document text and data extraction service from AWS.
- Parameters:
profile_name -- The AWS profile name to use for authentication. Default is None.
region_name -- The AWS region name where the Textract service is available.
kms_key_id -- The AWS Key Management Service (KMS) key ID for encryption.
Example
table_extractor = TextractTableExtractor(profile_name="my-profile", region_name="us-east-1") context = sycamore.init() pdf_docset = context.read.binary(paths, binary_format="pdf") .partition(partitioner=UnstructuredPdfPartitioner(), table_extractor=table_extractor)