简体   繁体   中英

How can I run pytesseract / tesseract in Foundry Code Repositories?

I am trying to use the function image_to_string from the library pytesseract in a repository to perform OCR of PDFs. However, I am getting the following error:

在此处输入图像描述

From the checks I would assume the library was loaded correctly:

在此处输入图像描述

Does anyone have an idea how to trouble shoot here?

It seems like Foundry is not respecting / running the environment activation script https://github.com/conda-forge/tesseract-feedstock/blob/main/recipe/activate.sh that sets the TESSDATA_PREFIX environment variable automatically. However, we can infer the value manually and provide it to the pytesseract API calls.

Define the following helper function:

def _get_tessdata_directory_path():
    import os
    from pathlib import Path
    if 'PYSPARK_PYTHON' in os.environ:
        pyspark_python = Path(os.environ['PYSPARK_PYTHON'])
        env_root = pyspark_python.parent.parent
    elif 'CONDA_PREFIX' in os.environ:
        env_root = Path(os.environ['CONDA_PREFIX'])
    else:
        raise ValueError('No env. variable present.')
    share_dir = env_root / 'share' / 'tessdata'
    assert share_dir.exists(), 'tessdata directory does not exist in <envroot>/share/tessdata'
    return str(share_dir)

and use it like shown in the following snippet:

tessdata_dir_config = f'--tessdata-dir "{_get_tessdata_directory_path()}"'
pytesseract.image_to_string(image, ..., config=tessdata_dir_config)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM