在 Palantir Foundry 的代码存储库中使用 PDF2Image

Question

I am trying to use the library pdf2image in a Code Repository on Palantir Foundry and getting the error我正在尝试在 Palantir Foundry 的代码存储库中使用库 pdf2image 并收到错误

pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. pdf2image.exceptions.PDFInfoNotInstalledError：无法获取页数。 Is poppler installed and in PATH? poppler 是否已安装并位于 PATH 中？

when using the function convert_from_bytes.使用 function convert_from_bytes 时。

Does anyone know how to reference the poppler path and get rid of this error?有谁知道如何引用 poppler 路径并消除此错误？

Thanks!谢谢！

Here is the code:这是代码：

def extract_pdf_text(input_bytes, language='eng', dpi=200):
    pages = convert_from_bytes(input_bytes, dpi)
    pdf_pages = ''
    for page_index, page in enumerate(pages):
        pdf_page = pytesseract.image_to_string(page, lang=language)
        pdf_pages = pdf_pages + pdf_page
    return pdf_pages

And the meta.yaml for reference:和 meta.yaml 供参考：

# If you need to modify the runtime requirements for your package,
# update the 'requirements.run' section in this file

package:
  name: "{{ PACKAGE_NAME }}"
  version: "{{ PACKAGE_VERSION }}"

source:
  path: ../src

requirements:
  # Tools required to build the package. These packages are run on the build system and include
  # things such as revision control systems (Git, SVN) make tools (GNU make, Autotool, CMake) and
  # compilers (real cross, pseudo-cross, or native when not cross-compiling), and any source pre-processors.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#build
  build:
    - python 3.8.*
    - setuptools

  # Packages required to run the package. These are the dependencies that are installed automatically
  # whenever the package is installed.
  # https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#run
  run:
    - python 3.8.*
    - transforms {{ PYTHON_TRANSFORMS_VERSION }}
    - transforms-expectations
    - transforms-verbs
    - pytesseract
    - pdfplumber
    - googletrans
    - regex
    - pdf2image
    - langdetect
    - pandas
    - numpy
    - selenium
    - requests
    - pypdf2
    - poppler

build:
  script: python setup.py install --single-version-externally-managed --record=record.txt

Answer 1

I found the problem when inspecting the CI-Checks.我在检查 CI-Checks 时发现了问题。 They failed before poppler was pulled.他们在 poppler 被拉出之前就失败了。 After I cleaned up meta.yaml and the checks succeded everything seems to work fine.在我清理了 meta.yaml 并且检查成功之后，一切似乎都运行良好。

在 Palantir Foundry 的代码存储库中使用 PDF2Image

问题描述

1 个解决方案

解决方案1
2 2022-03-28 11:40:10

在 Palantir Foundry 的代码存储库中使用 PDF2Image

问题描述

1 个解决方案

解决方案1 2 2022-03-28 11:40:10

解决方案1
2 2022-03-28 11:40:10