如何將 PDF 轉換為 Python 中的灰度

Question

是否可以使用 Python 庫將 PDF 文件轉換為其灰度等效文件？ 我試過ghostscript模塊：

import locale
from io import BytesIO
import ghostscript as gs

ENCO = locale.getpreferredencoding()
STDOUT = BytesIO()
STDERR = BytesIO()

with open('adob_in.pdf', 'r') as infile:
    ARGS = f"""DUMMY -sOutputFile=adob_out.pdf -sDEVICE=pdfwrite
     -sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray
     -dNOPAUSE -dBATCH {infile.name}"""

    ARGSB = [arg.encode(ENCO) for arg in ARGS.split()]

    gs.Ghostscript(*ARGSB, stdout=STDOUT, stderr=STDERR)

print(STDOUT.getvalue().decode(ENCO))
print(STDERR.getvalue().decode(ENCO))

標准輸出和錯誤流是：

GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1

不幸的是，灰度 PDF 已損壞。 實際上，使用 Ghostscript 進行調試會顯示以下錯誤：

GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Error: Cannot find a 'startxref' anywhere in the file.
               Output may be incorrect.
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
   **** Error:  Trailer dictionary not found.
                Output may be incorrect.
   No pages will be processed (FirstPage > LastPage).

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

   **** The rendered output from this file may be incorrect.
GS>

請注意，字符串ARGS包含有效的 ghostscript 代碼（在 Linux 命令行中使用GPL Ghostscript 9.52進行測試）並且ARGSB只是字符串的相應二進制表示：

print(ARGSB)
[b'DUMMY', b'-sOutputFile=adob_out.pdf', b'-sDEVICE=pdfwrite', b'-sColorConversionStrategy=Gray', b'-dProcessColorModel=/DeviceGray', b'-dNOPAUSE', b'-dBATCH', b'adob_in.pdf']

如何正確完成這項任務？ 我的示例輸入和 output 文件可以在這里找到。 非常感謝您提前。

Answer 1

我不知道如何通過 ghostscript 做到這一點，但使用pdf2image和img2pdf的以下代碼可以達到目的：

from os.path import join
from tempfile import TemporaryDirectory
from pdf2image import convert_from_path # https://pypi.org/project/pdf2image/
from img2pdf import convert # https://pypi.org/project/img2pdf/

with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM to speed up parsing
    # Converting pages to images
    print("Parsing pages to grayscale images. This may take a while")
    images = convert_from_path(
        "your_pdf_path.pdf",
        output_folder=temp_dir,
        grayscale=True,
        fmt="jpeg",
        thread_count=4
    )

    image_list = list()
    for page_number in range(1, len(images) + 1):
        path = join(temp_dir, "page_" + str(page_number) + ".jpeg")
        image_list.append(path)
        images[page_number-1].save(path, "JPEG") # (page_number - 1) because index starts from 0

    with open("Gray_PDF.pdf", "bw") as gray_pdf:
        gray_pdf.write(convert(image_list))

    print("The new page is saved as Gray_PDF.pdf in the current directory.")

帶有灰度圖像的 PDF 文件將在同一目錄下保存為 Gray_PDF.pdf。

說明：以下代碼：

with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM. This speeds up parsing
    # Converting pages to images
    print("Parsing pages to grayscale images. This may take a while")
    images = convert_from_path(
        "your_pdf_path.pdf",
        output_folder=temp_dir,
        grayscale=True,
        fmt="jpeg",
        thread_count=4
    )

執行以下任務：

將 PDF 頁面轉換為灰度圖像。
將其臨時存儲在目錄中。
創建 PIL 圖像對象的列表images

現在下面的代碼：

    image_list = list()
    for page_number in range(1, len(images) + 1):
        path = join(temp_dir, "page_" + str(page_number) + ".jpeg")
        image_list.append(path)
        images[page_number-1].save(path, "JPEG") # (page_number - 1) because index starts from 0

將圖像再次保存為page_1.jpeg ， page_2.jpeg等在同一目錄中。 它還列出了這些新圖像的路徑。

最后，下面的代碼：

    with open("Gray_PDF.pdf", "bw") as gray_pdf:
        gray_pdf.write(convert(image_list))

從之前創建的灰度圖像中創建一個名為Gray_PDF的 PDF 並將其保存在工作目錄中。

附加提示：如果您想使用 OpenCV 執行更多圖像處理操作，此方法為您提供了很大的靈活性，因為所有頁面現在都是圖像形式。 只需確保所有操作都在第一個with語句中，即以下內容：

with TemporaryDirectory() as temp_dir: # Saves images temporarily in disk rather than RAM. This speeds up parsing

如何將 PDF 轉換為 Python 中的灰度

問題描述

1 個解決方案

解決方案1
1 已采納 2021-01-16 23:49:06

如何將 PDF 轉換為 Python 中的灰度

問題描述

1 個解決方案

解決方案1 1 已采納 2021-01-16 23:49:06

解決方案1
1 已采納 2021-01-16 23:49:06