简体   繁体   English

打印到 pdf 文件中每一页的第一行

[英]Print to excel first line of each page in pdf file

I am new to python, only one script behind me for searching strings in pdfs.我是 python 新手,只有一个脚本用于在 pdf 中搜索字符串。 Now, I would like to build script which will give me results in new CSV/xlsx file where I will have first lines and their page numbers of given pdf file.现在,我想构建一个脚本,它将在新的 CSV/xlsx 文件中给出结果,我将在其中包含给定 pdf 文件的第一行及其页码。 For now I have code below for printing whole page:现在我有下面的代码来打印整页:

from PyPDF2 import PdfFileReader

pdf_document = "example.pdf"
with open(pdf_document, "rb") as filehandle:
    pdf = PdfFileReader(filehandle)
    info = pdf.getDocumentInfo()
    pages = pdf.getNumPages()
    print (info)
    print ("number of pages: %i" % pages)
    page1 = pdf.getPage(0)
    print(page1)
    print(page1.extractText())

You can read pdf file page by page, split by '\\n' (if that is the character that splits lines), then use the CSV package to write into a CSV file.您可以逐页阅读pdf文件,用'\\n'分割(如果是分割线的字符),然后使用CSV包写入CSV文件。 A script like below.像下面这样的脚本。 Just to mention that it if the PDF contains images this code will not be able to extract text.只是提一下,如果 PDF 包含图像,则此代码将无法提取文本。 You need an OCR module to convert images to text first.您首先需要一个 OCR 模块将图像转换为文本。

from PyPDF2 import PdfFileReader
import csv

pdf_document = "test.pdf"
with open(pdf_document, "rb") as filehandle:
    pdf = PdfFileReader(filehandle)
    with open('result.csv','w') as csv_file:
        csv_writer = csv.writer(csv_file)
        csv_writer.writerow(['page numebr','firts line'])
        for i in range(0, pdf.getNumPages()):
            content= pdf.getPage(i).extractText().split('\n')
            print(content[0]) # prints first line
            print(i+1) # prints page number
            print('-------------')
            csv_writer.writerow([i+1,content[0]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM