简体   繁体   English

在 Palantir Foundry 的代码存储库中使用 PDF2Image

[英]Using PDF2Image in Code Repository on Palantir Foundry

I am trying to use the library pdf2image in a Code Repository on Palantir Foundry and getting the error我正在尝试在 Palantir Foundry 的代码存储库中使用库 pdf2image 并收到错误

pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. pdf2image.exceptions.PDFInfoNotInstalledError:无法获取页数。 Is poppler installed and in PATH? poppler 是否已安装并位于 PATH 中?

when using the function convert_from_bytes.使用 function convert_from_bytes 时。

Does anyone know how to reference the poppler path and get rid of this error?有谁知道如何引用 poppler 路径并消除此错误?

Thanks!谢谢!

Here is the code:这是代码:

def extract_pdf_text(input_bytes, language='eng', dpi=200):
    pages = convert_from_bytes(input_bytes, dpi)
    pdf_pages = ''
    for page_index, page in enumerate(pages):
        pdf_page = pytesseract.image_to_string(page, lang=language)
        pdf_pages = pdf_pages + pdf_page
    return pdf_pages

And the meta.yaml for reference:和 meta.yaml 供参考:

# If you need to modify the runtime requirements for your package,
# update the 'requirements.run' section in this file

package:
  name: "{{ PACKAGE_NAME }}"
  version: "{{ PACKAGE_VERSION }}"

source:
  path: ../src

requirements:
  # Tools required to build the package. These packages are run on the build system and include
  # things such as revision control systems (Git, SVN) make tools (GNU make, Autotool, CMake) and
  # compilers (real cross, pseudo-cross, or native when not cross-compiling), and any source pre-processors.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#build
  build:
    - python 3.8.*
    - setuptools

  # Packages required to run the package. These are the dependencies that are installed automatically
  # whenever the package is installed.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#run
  run:
    - python 3.8.*
    - transforms {{ PYTHON_TRANSFORMS_VERSION }}
    - transforms-expectations
    - transforms-verbs
    - pytesseract
    - pdfplumber
    - googletrans
    - regex
    - pdf2image
    - langdetect
    - pandas
    - numpy
    - selenium
    - requests
    - pypdf2
    - poppler

build:
  script: python setup.py install --single-version-externally-managed --record=record.txt

I found the problem when inspecting the CI-Checks.我在检查 CI-Checks 时发现了问题。 They failed before poppler was pulled.他们在 poppler 被拉出之前就失败了。 After I cleaned up meta.yaml and the checks succeded everything seems to work fine.在我清理了 meta.yaml 并且检查成功之后,一切似乎都运行良好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Palantir Foundry 代码存储库中定义 Pandas UDF 的正确方法是什么 - What is the proper way to define a Pandas UDF in a Palantir Foundry Code Repository 无法将地理空间工具依赖项添加到 Palantir Foundry 代码库 - Can not add geospatial-tools dependency to Palantir Foundry code repository 在 Palantir Foundry 的代码工作簿中如何分配执行者? - How are executors assigned in Code Workbooks in Palantir Foundry? 我们可以使用 Palantir Foundry 进行图像处理吗? - Can we do image processing with Palantir Foundry? 在多个代码库中搜索关键字 - Palantir Foundry - Searching for keywords in multiple code repositories - Palantir Foundry Palantir Foundry Fusion 访问使用 API - Palantir Foundry Fusion access using API 如何在 Palantir Foundry 的 Code Workbook 中使用 sparkcontext 创建一个空数据集? - How can I create an empty dataset using sparkcontext in Code Workbook in Palantir Foundry? 在 Palantir Foundry 中使用存储库导入图像和 zip.docx 文件 - Import Images and zip .docx files with repository in Palantir Foundry 在 Slate 应用程序中显示存储在 Palantir Foundry 数据集中的 PDF 文件 - Displaying a PDF file stored in a Dataset on Palantir Foundry in Slate Application 在 Palantir Foundry 代码库中创建主键数据健康预期 - Creating a primary key data health expectation in Palantir Foundry Code Repositories
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM