简体   繁体   English

如何在Python / Django中逐行读取pdf文件?

[英]How to read the pdf file in a line by line string in Python/Django?

I am dealing with the text and pdf file equal or less than 5KB . 我正在处理等于或小于5KB的文本和pdf文件。 If the file is a text file, I get a file from the form and get the required input in a string to summarize: 如果文件是文本文件,我将从表单中获取文件,并以字符串形式获取所需的输入以进行汇总:

 file = file.readlines()
 file = ''.join(file)
 result = summarize(file, num_sentences)

It's easily done but for pdf file it turns out it's not that easy. 这很容易做到,但是对于pdf文件,事实并非如此简单。 Is there a way to get the sentences of pdf file as a string like I did with my txt file in Python/Django? 有没有办法像在Python / Django中使用txt文件那样将pdf文件的句子作为字符串获取?

I dont think its possible to read pdfs just the way you are doing it with txt files, you need to convert the pdfs into txt files(refer Python module for converting PDF to text ) and then process it. 我不认为有可能像处理txt文件一样读取pdf,需要将pdf转换为txt文件(请参阅Python模块将PDF转换为文本 ),然后进行处理。 you can also refer to this to convert pdf to txt easily http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/ 您也可以参考此文件,轻松地将pdf转换为txt http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

In Django you can do this: 在Django中,您可以执行以下操作:

views.py : views.py:

def upload_pdf():
     if request.method == 'POST' and request.FILES['myfile']:
        pdfFileObj = request.FILES['myfile'].read() 
        pdfReader = PyPDF2.PdfFileReader(io.BytesIO(pdfFileObj))
        NumPages = pdfReader.numPages
        i = 0
        content = []
        while (i<NumPages):
            text = pdfReader.getPage(i)
            content.append(text.extractText())
            i +=1
       # depends on what you want to do with the pdf parsing results
       return render(request, .....) 

html part: html部分:

<form method="post" enctype="multipart/form-data" action="/url">
    {% csrf_token %}
      <input  type="file" name="myfile"> # the name is the same as the one you put in FILES['myfile']
    <button class="butto" type="submit">Upload</button>
</form>

In Python you can simply do this : 在Python中,您可以执行以下操作:

fileName = "path/test.pdf"
pdfFileObj = open(fileName,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
NumPages = pdfReader.numPages

i = 0
content = []
while (i<NumPages):
    text = pdfReader.getPage(i)
    content.append(text.extractText())
    i +=1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM