简体   繁体   中英

Incorrect output result: Text extraction for .pdf, .pptx, and .docx in python

I created a function that will open each file in a directory and extract the text from each file and output it in an excel sheet using Pandas. The indexing for each file type seems to be working just fine.However when the text gets extracted from the first file in the path directory It seems to be replacing the other extracted text from the other files with the first file's extracted text. Please help, thank you!

from pathlib import Path 
import shutil
from datetime import datetime
import time
from configparser import ConfigParser
import glob
import fileinput
import pandas as pd
import os
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import docx2txt
from pptx import Presentation

p = Path('C:/Users/XXXX/Desktop/test_folder')

txt_files = list(p.rglob('*txt'))
PDF_files = list(p.rglob('*pdf'))
csv_files = list(p.rglob('*csv'))
docx_files = list(p.rglob('*docx'))
pptx_files = list(p.rglob('*pptx'))


def loader(path):
    with open(str(path.resolve()),"r",encoding = "ISO-8859-1") as f:
        docx_out,pptx_out = [],[]
        data = []
        print(pptx_files)
        if path.suffix == ".pdf":
            for name1 in PDF_files:
                 data.append(pdf_to_text(name1))
                 return data
        elif path.suffix == ".docx":
            for name2 in docx_files:
                docx_out = (docx2txt.process(name2))
                return docx_out
        elif path.suffix == ".pptx":
            for file in pptx_files:
                prs = Presentation(file)
                for slide in prs.slides:
                    for shape in slide.shapes:
                        if not shape.has_text_frame:
                            continue
                        for paragraph in shape.text_frame.paragraphs:
                            for run in paragraph.runs:
                                pptx_out.append(run.text)
                return pptx_out
        else:
                return f.readlines()

Example of the output is:

Text content file name this is a test first_pdf.pdf

this is a test second_pdf.pdf

  • the "second_pdf.pdf" does not contain "this is a test" but for some reason it takes in whatever text that gets extracted from the first pdf. (same goes for all files types.

This block

    if path.suffix == ".pdf":
        for name1 in PDF_files:
             data.append(pdf_to_text(name1))
             return data

returns from your function after appending the first PDF file. It never gets to the second one because you are returning from inside the for loop. This should fix it:

    if path.suffix == ".pdf":
        for name1 in PDF_files:
             data.append(pdf_to_text(name1))
         return data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM