简体   繁体   English

如何在 Foundry 代码库中运行 pytesseract / tesseract?

[英]How can I run pytesseract / tesseract in Foundry Code Repositories?

I am trying to use the function image_to_string from the library pytesseract in a repository to perform OCR of PDFs.我正在尝试使用存储库中库 pytesseract 中的函数 image_to_string 来执行 PDF 的 OCR。 However, I am getting the following error:但是,我收到以下错误:

在此处输入图像描述

From the checks I would assume the library was loaded correctly:从检查中我会假设库已正确加载:

在此处输入图像描述

Does anyone have an idea how to trouble shoot here?有谁知道如何在这里解决问题?

It seems like Foundry is not respecting / running the environment activation script https://github.com/conda-forge/tesseract-feedstock/blob/main/recipe/activate.sh that sets the TESSDATA_PREFIX environment variable automatically. Foundry 似乎不尊重/运行自动设置TESSDATA_PREFIX环境变量的环境激活脚本https://github.com/conda-forge/tesseract-feedstock/blob/main/recipe/activate.sh However, we can infer the value manually and provide it to the pytesseract API calls.但是,我们可以手动推断该值并将其提供给 pytesseract API 调用。

Define the following helper function:定义以下辅助函数:

def _get_tessdata_directory_path():
    import os
    from pathlib import Path
    if 'PYSPARK_PYTHON' in os.environ:
        pyspark_python = Path(os.environ['PYSPARK_PYTHON'])
        env_root = pyspark_python.parent.parent
    elif 'CONDA_PREFIX' in os.environ:
        env_root = Path(os.environ['CONDA_PREFIX'])
    else:
        raise ValueError('No env. variable present.')
    share_dir = env_root / 'share' / 'tessdata'
    assert share_dir.exists(), 'tessdata directory does not exist in <envroot>/share/tessdata'
    return str(share_dir)

and use it like shown in the following snippet:并使用它,如以下代码段所示:

tessdata_dir_config = f'--tessdata-dir "{_get_tessdata_directory_path()}"'
pytesseract.image_to_string(image, ..., config=tessdata_dir_config)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当我使用CREATE_NO_WINDOW运行带有pytesseract的tesseract时,如何隐藏控制台窗口 - How to hide the console window when I run tesseract with pytesseract with CREATE_NO_WINDOW 如何将 tesseract 添加到我的 Docker 容器中,以便我可以使用 pytesseract - How do I add tesseract to my Docker container so i can use pytesseract AttributeError:模块“pytesseract.pytesseract”没有属性“pytesseract”。 你的意思是:&#39;run_tesseract&#39;? - AttributeError: module 'pytesseract.pytesseract' has no attribute 'pytesseract'. Did you mean: 'run_tesseract'? 不能运行pytesseract? - Can't run pytesseract? 如何一次运行多种语言的tesseract? - How can I run tesseract with multiple languages one time? AttributeError: 模块“pytesseract”没有属性“run_tesseract” - AttributeError: module 'pytesseract' has no attribute 'run_tesseract' Pytesseract:“TesseractNotFound 错误:tesseract 未安装或不在您的路径中”,我该如何解决? - Pytesseract : "TesseractNotFound Error: tesseract is not installed or it's not in your path", how do I fix this? 改善tesseract结果(pytesseract) - Improve tesseract results (pytesseract) 如何使用opencl实现pytesseract代码使其在GPU上运行? - How to implement pytesseract code with opencl to make it run on GPU? 如何将 tessdata_best 用于 tesseract (pytesseract)。 论据和程序是什么? - how to use tessdata_best for tesseract (pytesseract). What are the arguments and procedure?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM