partition#

aryn_sdk.partition.convert_image_element(elem: dict, format: str = 'PIL', b64encode: bool = False) Image | bytes | str | None[source]#

Convert an image element to a more useable format. If no format is specified, create a PIL Image object. If a format is specified, output the bytes of the image in that format. If b64encode is set to True, base64-encode the bytes and return them as a string.

Parameters:
  • elem -- an image element from the 'elements' field of a partition_file response

  • format -- an optional format to output bytes of. Default is PIL

  • b64encode -- base64-encode the output bytes. Format must be set to use this

Example

from aryn_sdk.partition import partition_file, convert_image

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image(image_elts[0])
jpg_bytes = convert_image(image_elts[1], format='JPEG')
png_str = convert_image(image_elts[2], format="PNG", b64encode=True)
aryn_sdk.partition.draw_with_boxes(pdf_file: PathLike | BinaryIO | str, partitioning_data: dict, draw_table_cells: bool = False) list[Image][source]#

Create a list of images from the provided pdf, one for each page, with bounding boxes detected by the partitioner drawn on.

Parameters:
  • pdf_file -- an open file or path to a pdf file upon which to draw

  • partitioning_data -- the output from aryn_sdk.partition.partition_file

  • draw_table_cells -- whether to draw individually detected cells of tables. default: False

Returns:

a list of images of pages of the pdf, each with bounding boxes drawn on

Example

from aryn_sdk.partition import partition_file, draw_with_boxes

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        aryn_api_key="MY-ARYN-TOKEN",
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
pages = draw_with_boxes("my-favorite-pdf.pdf", data, draw_table_cells=True)
aryn_sdk.partition.partition_file(file: BinaryIO | str | PathLike, aryn_api_key: str | None = None, aryn_config: ArynConfig | None = None, threshold: float | Literal['auto'] | None = None, use_ocr: bool = False, ocr_images: bool = False, extract_table_structure: bool = False, extract_images: bool = False, selected_pages: list[list[int] | int] | None = None, aps_url: str = 'https://api.aryn.cloud/v1/document/partition', ssl_verify: bool = True, output_format: str | None = None) dict[source]#

Sends file to the Aryn Partitioning Service and returns a dict of its document structure and text

Parameters:
  • file -- pdf file to partition

  • aryn_api_key -- aryn api key, provided as a string

  • aryn_config -- ArynConfig object, used for finding an api key. If aryn_api_key is set it will override this. default: The default ArynConfig looks in the env var ARYN_API_KEY and the file ~/.aryn/config.yaml

  • threshold -- value in to specify the cutoff for detecting bounding boxes. Must be set to "auto" or a floating point value between 0.0 and 1.0. default: None (APS will choose)

  • use_ocr -- extract text using an OCR model instead of extracting embedded text in PDF. default: False

  • ocr_images -- attempt to use OCR to generate a text representation of detected images. default: False

  • extract_table_structure -- extract tables and their structural content. default: False

  • extract_images -- extract image contents. default: False

  • selected_pages -- list of individual pages (1-indexed) from the pdf to partition default: None

  • aps_url -- url of the Aryn Partitioning Service endpoint. default: "https://api.aryn.cloud/v1/document/partition"

  • ssl_verify -- verify ssl certificates. In databricks, set this to False to fix ssl imcompatibilities.

  • output_format -- controls output representation; can be set to markdown. default: None (JSON elements)

Returns:

A dictionary containing "status" and "elements". If output_format is markdown, dictionary of "status" and "markdown".

Example

from aryn_sdk.partition import partition_file

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        aryn_api_key="MY-ARYN-TOKEN",
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements = data['elements']
aryn_sdk.partition.table_elem_to_dataframe(elem: dict) DataFrame | None[source]#

Create a pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, return None instead.

Parameters:

elem -- An element from the 'elements' field of a partition_file response.

Example

from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )

# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
    if element['type'] == 'table':
        df = table_elem_to_dataframe(element)
        break
aryn_sdk.partition.tables_to_pandas(data: dict) list[tuple[dict, DataFrame | None]][source]#

For every table element in the provided partitioning response, create a pandas DataFrame representing the tabular data. Return a list containing all the elements, with tables paired with their corresponding DataFrames.

Parameters:

data -- a response from partition_file

Example

from aryn_sdk.partition import partition_file, tables_to_pandas

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        aryn_api_key="MY-ARYN-TOKEN",
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elts_and_dataframes = tables_to_pandas(data)