简体   繁体   中英

How to read the pdf file in a line by line string in Python/Django?

I am dealing with the text and pdf file equal or less than 5KB . If the file is a text file, I get a file from the form and get the required input in a string to summarize:

 file = file.readlines()
 file = ''.join(file)
 result = summarize(file, num_sentences)

It's easily done but for pdf file it turns out it's not that easy. Is there a way to get the sentences of pdf file as a string like I did with my txt file in Python/Django?

I dont think its possible to read pdfs just the way you are doing it with txt files, you need to convert the pdfs into txt files(refer Python module for converting PDF to text ) and then process it. you can also refer to this to convert pdf to txt easily http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

In Django you can do this:

views.py :

def upload_pdf():
     if request.method == 'POST' and request.FILES['myfile']:
        pdfFileObj = request.FILES['myfile'].read() 
        pdfReader = PyPDF2.PdfFileReader(io.BytesIO(pdfFileObj))
        NumPages = pdfReader.numPages
        i = 0
        content = []
        while (i<NumPages):
            text = pdfReader.getPage(i)
            content.append(text.extractText())
            i +=1
       # depends on what you want to do with the pdf parsing results
       return render(request, .....) 

html part:

<form method="post" enctype="multipart/form-data" action="/url">
    {% csrf_token %}
      <input  type="file" name="myfile"> # the name is the same as the one you put in FILES['myfile']
    <button class="butto" type="submit">Upload</button>
</form>

In Python you can simply do this :

fileName = "path/test.pdf"
pdfFileObj = open(fileName,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
NumPages = pdfReader.numPages

i = 0
content = []
while (i<NumPages):
    text = pdfReader.getPage(i)
    content.append(text.extractText())
    i +=1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM