简体   繁体   中英

Unable to append to df while iterating through Try/Except

I am trying to iterate through pdfs to extract information from emails. My individual regex statements work when I try them on individual examples, however, when I try to put all the code together in a for loop to iterate over multiple pdfs at once, I am unable to append to my aggregate df (I'm currently just creating an empty df). I need to use the try/except because not all emails have all fields (eg. some do not have the 'Attachments' field). Below is the code I have written so far:

import os
import pandas as pd
pd.options.display.max_rows=999
import numpy
from numpy import NaN
from tika import parser

root = r"my_dir"

agg_df = pd.DataFrame()

for directory, subdirectory, files in os.walk(root):
    for file in files:
        filepath = os.path.join(directory, file)
        print(file)
        raw = parser.from_file(filepath)
        img = raw['content']
        img = img.replace('\n', '')

        try:
            from_field = re.search(r'From:(.*?)Sent:', img).group(1)
        except:
            pass
        try:
            sent_field = re.search(r'Sent:(.*?)To:', img).group(1)
        except:
            pass
        try:    
            to_field = re.search(r'To:(.*?)Cc:', img).group(1)
        except:
            pass
        try:    
            cc_field = re.search(r'Cc:(.*?)Subject:', img).group(1)
        except:
            pass
        try:   
            subject_field = re.search(r'Subject:(.*?)Attachments:', img).group(1)
        except:
            pass
        try:
            attachments_field = re.search(r'Attachments:(.*?)NOTICE', img).group(1)
        except:
            pass

        img_df = pd.DataFrame(columns=['From', 'Sent', 'To', 
                                       'Cc', 'Subject', 'Attachments'])
        img_df['From'] = from_field
        img_df['Sent'] = sent_field
        img_df['To'] = to_field
        img_df['Cc'] = cc_field
        img_df['Subject'] = subject_field
        img_df['Attachments'] = attachments_field

        agg_df = agg_df.append(img_df)

There are two things:

  1. When you don't get a match you shouldn't just pass the exception. You should use a default value.
  2. Don't append to your dataframe after each time through the loop. That is slow . Keep everything in a dictionary, and then construct the dataframe at the end.

Eg

from collections import defaultdict

data = defaultdict(list)

for directory, _, files in os.walk(root):
    for file in files:
        filepath = os.path.join(directory, file)
        print(file)
        raw = parser.from_file(filepath)
        img = raw['content']
        img = img.replace('\n', '')

        from_match = re.search(r'From:(.*?)Sent:', img)
        if not from_match:
            sent_by = None
        else:
            sent_by = from_match.group(1)
        data["from"].append(sent_by)

        to_match = re.search(r'Sent:(.*?)To:', img)
        if not to_match:
            sent_to = None
        else:
            sent_to = to_match.group(1)
        data["to"].append(sent_to)

        # All your other regexes

df = pd.DataFrame(data)

Also, if you're doing this for a lot of files you should look into using compiled expression .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM