Iterate through folder, extract text, create single df

Question

I am trying to iterate through a number of PDF files within a folder on my desktop. My goal is to read the text from each of these PDFs (they are all only one page long) and place each distinct PDF's text into a new row within one dataframe.

I have tried looping through the folder, and it has worked in terms of providing me with text outputs from all the PDFs I have in that folder (I have created a folder with two "test" PDFs to see if the code works), but it fails to concatenate the text into one single dataframe. I would like for the output of my code to create a single dataframe with new rows containing each PDF's text so that I can export it to a csv afterward. The output I am getting is instead two separate dataframes that, once I export to a csv, do not transfer their text into the csv file. In fact, the code I have written I believe overwrites every dataframe except for the last one created, thus producing only one object called "df". Any help would be greatly appreciated, hope this query was clear enough, I have seen related threads but have not been able to find one that solves this exact issue.

rootdir = 'directory file path'
for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            doc = fitz.open(file)
            page = doc[0]
            text = page.getText("text")

            text_list = []                    #create list to store text in

            text_list.append(text)            # append the text to the list
            df = pd.DataFrame(text_list)      #create a df from the list
            df.columns = ['text']

            doc.close()

            print(df)

Output is below:

         text
0  Dummy PDF file\n
                                                text
0   \n \n \n \n \n \nThis is a test PDF document....

Answer 1

Although the question is quite old, let me answer to help someone in case they have similar issue.

I believe overwrites every dataframe except for the last one created

it's because you overwrite the object (df) and list (text_list) in every iteration. for example :

df (result of iteration 1) = df(result of iteration 2)
df (result of iteration 2) = df(result of iteration 3)
df (result of iteration 3) = df(result of iteration 4)

and so on until df only contains with last iteration, here i'm fix your code:

rootdir = 'directory file path'
text_list = [] #create list to store text in

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        doc = fitz.open(file)
        page = doc[0]
        text = page.getText("text")

    text_list.append(text) # append the text to the list
    doc.close()

#create a df from the list and specified the column at once
df = pd.DataFrame(text_list, columns=['text']) 
print(df)

Iterate through folder, extract text, create single df

Question

1 answers

solution1
1 2021-08-05 14:17:54

Iterate through folder, extract text, create single df

Question

1 answers

solution1 1 2021-08-05 14:17:54

solution1
1 2021-08-05 14:17:54