來自 single_files.tiff 文件夾的批處理 multipage.pdf

Question

我想單頁.tif文件轉換my_folder ，多頁.pdf文件pdf_folder 。 沒有后續頁面的 TIFF 也應轉換為單頁 PDF。 最終，我想要一個通過 OCR 處理多個基於圖像的 TIFF 文件創建的文本 PDF。

因此，我推斷應該從文件名模式中組合在一起的.tiff文件組：

Drs_1_00109_1_ADS.tif
Drs_1_00099_1_ADS_000.tif
Drs_1_00099_1_ADS_001.tif
Drs_1_00099_1_ADS_002.tif
Drs_1_00186_1_ADS.tif
Drs_1_00192_1_ADS_000.tif
Drs_1_00192_1_ADS_001.tif

例如，在Drs_1_00192_1_ADS_000.tif和Drs_1_00192_1_ADS_001.tif （這是兩個 [單頁] 圖片）中，我想轉換為具有這兩個圖片文本數據的 2 頁Drs_1_00192_1_ADS.pdf 。 以下代碼適用於單頁 pdf 創建。 我怎樣才能使文件名中的多頁模式適用於這項工作？ 我寧願用 pytesseract 或 cv2 來做，因為我想設置一些配置參數並預處理圖像方向。 我還沒有找到任何用於創建多頁 pdf 的 cli 解決方案。

import img2pdf
my_folder = "/path/to/images"
images = []
for fname in os.listdir(my_folder):
    if not fname.endswith(".tif"):
        continue
    path = os.path.join(my_folder, fname)
    if os.path.isdir(path):
        continue
    images.append(path)
with open("name.pdf","wb") as f:
    f.write(img2pdf.convert(imgs))

謝謝！

Answer 1

我會通過對所有以000.tif結尾的文件進行通000.tif來做到這000.tif ，這大概是多頁文檔的起點，然后附加由增加后綴產生的文件，直到文件丟失。

#!/usr/bin/env python3

import os
from PIL import Image
from glob import glob

# Iterate over all files ending in '000.tif' and find their friends (subsequent pages)
for filename in glob('*_000.tif'):
   # Work out stem of filename
   stem = filename.replace('_000.tif', '')
   print(f'DEBUG: stem={stem}')

   # Build list of images to be put in this PDF
   images = [Image.open(filename)]
   index = 1
   while True:
      this = f'{stem}_{index:03d}.tif'
      print(f'DEBUG: this={this}')
      if os.path.isfile(this):
         images.append(Image.open(this))
         index += 1
      else:
         break
   output = stem + '.pdf'
   print(f'DEBUG: Saving {len(images)} pages to {output}')
   images[0].save(output, save_all=True, append_images=images[1:])

樣本輸出

DEBUG: stem=Drs_1_00192_1_ADS
DEBUG: this=Drs_1_00192_1_ADS_001.tif
DEBUG: this=Drs_1_00192_1_ADS_002.tif
DEBUG: this=Drs_1_00192_1_ADS_003.tif
DEBUG: this=Drs_1_00192_1_ADS_004.tif
DEBUG: Saving 4 pages to Drs_1_00192_1_ADS.pdf
DEBUG: stem=Drs_1_00099_1_ADS
DEBUG: this=Drs_1_00099_1_ADS_001.tif
DEBUG: this=Drs_1_00099_1_ADS_002.tif
DEBUG: this=Drs_1_00099_1_ADS_003.tif
DEBUG: Saving 3 pages to Drs_1_00099_1_ADS.pdf

請注意，您可以通過替換以下內容輕松地使用OpenCV讀取文件：

image = Image.open(filename)

和

image = cv2.imread(filename)

但是，您不能像使用PIL那樣簡單地使用OpenCV編寫 PDF，所以我只使用PIL 。 如果您記得PIL使用 RGB 排序而OpenCV使用 BGR，您可以輕松地在PIL和OpenCV之間移動，因此您可以使用以下命令從PIL轉到OpenCV ：

OpenCVImage = np.array(PILImage)[...,::-1]

和

PILImage = Image.fromarray(OpenCVImage[...,::-1])

來自 single_files.tiff 文件夾的批處理 multipage.pdf

問題描述

1 個解決方案

解決方案1
1 2021-10-16 10:47:57

來自 single_files.tiff 文件夾的批處理 multipage.pdf

問題描述

1 個解決方案

解決方案1 1 2021-10-16 10:47:57

解決方案1
1 2021-10-16 10:47:57