简体   繁体   English

如何在python 2.7中逐行从pdf提取文本

[英]How to extract text from pdf line by line in python 2.7

I'm trying to read and parse a PDF file containing a table... 我正在尝试读取和解析包含表格的PDF文件...

This is the table in the PDF: 这是PDF中的表格:

Table in pdf 表格pdf

and this is my code: 这是我的代码:

import PyPDF2
import re
from PyPDF2 import PdfFileReader , PdfFileWriter
FileRead = open("C:\\Users\\Zahraa Jawad\\S40rooms.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(FileRead)
pdfwriter = PdfFileWriter()
for page in pdfReader.pages:
    print page.extractText()

What I want is to read each line ( split ) in the table separately and save all information in the line ( YEAR, SEMESTER, ROOM, DAY, COURSE NO, INSTRUCTOR, TIME FROM, TIME TO, NUMBER OF STUDENTS ) in an array. 我想要的是分别读取表中的每行(split)并将所有信息(YEAR,SEMESTER,ROOM,DAY,COURSE NO,INSTRUCTOR,TIME FROM,TIME TO,STUDENTS)存储在数组中。 After each '\\n', I'd like to save the data in a new index in the array. 在每个'\\ n'之后,我想将数据保存在数组的新索引中。

However, my code does not work; 但是,我的代码不起作用。 it reads all the information and returns it as a paragraph! 它读取所有信息并将其作为段落返回! I don't know how to split each line. 我不知道如何分割每一行。

For example ( See the PDF above ): 例如(请参见上面的PDF):

341 458 01 Gazwa Sleebekh UTH 09:00 09:50 30 341458 01 Gazwa Sleebekh UTH 09:00 09:50 30

Output: YEAR, SEMESTER, ROOM, DAY, COURSE NO, INSTRUCTOR, TIME FROM, TIME TO, NUMBER OF STUDENTS 输出:年,学期,房间,天,课程号,讲师,时间从,时间到,学生人数

2015/2016, Second, S40-021, U, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016, Second, S40-021, T, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016, Second, S40-021, H, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016,第二,S40-021,U,341,Ghazwa Sleebekh,09:00,09:50,30 2015/2016,第二,S40-021,T,341,Ghazwa Sleebekh,09:00,09:50 ,30 2015/2016,Second,S40-021,H,341,Ghazwa Sleebekh,09:00,09:50,30

It's split by the UTH ( Day ) but my problem is how to read each line in the PDF and search within it using a regular expression :) 它被UTH(Day)分割,但是我的问题是如何读取PDF中的每一行并使用正则表达式在其中搜索:)

In converting PDF to text I've had the best results with using pdftotext from the poppler utilities. 在将PDF转换为文本时,使用poppler实用程序中的pdftotext最佳效果。 (You can find ms-windows binaries in several places [1] , [2] .) (您可以在[1][2]的多个位置找到ms-windows二进制文件。)

import subprocess

def pdftotext(pdf, page=None):
    """Retrieve all text from a PDF file.

    Arguments:
        pdf Path of the file to read.
        page: Number of the page to read. If None, read all the pages.

    Returns:
        A list of lines of text.
    """
    if page is None:
        args = ['pdftotext', '-layout', '-q', pdf, '-']
    else:
        args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
                '-q', pdf, '-']
    try:
        txt = subprocess.check_output(args, universal_newlines=True)
        lines = txt.splitlines()
    except subprocess.CalledProcessError:
        lines = []
    return lines

Note that text extraction only works if the PDF file actually contains text! 请注意,只有在PDF文件实际包含文本的情况下,文本提取才有效! Some PDF files only contain scanned images of text, in which case you'll need an OCR solution. 某些PDF文件仅包含扫描的文本图像,在这种情况下,您将需要OCR解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM