简体   繁体   中英

Python text parsing - how to capture and write multiple lines

I am trying to use Python to extract certain four data elements from @ 6,500 form-generated emails: subject field, sender's email address, date stamp, and sender's physical address.

I have written a simple Python script that successfully copies the first three data elements from each message and writes them to a new file. It is very easy to do this, because for each of these three data elements there is an unambiguous marker ("Subject", "From", or "Date") for each of the elements. Here is my Python script that successfully grabs those first three data elements:

with open("samplefile.txt") as f:
    with open("samplefileout.txt", "w") as f1:
        for line in f:
            line = line.rstrip()
            if "Subject: " in line:
                f1.write(line)
            if "From: " in line:
                f1.write(line) 
            if "Date: " in line:
                f1.write(line)

The fourth data element I want to capture, sender's physical address, is handled differently. Due to the webform nature of these emails, the sender's name and home address are ALWAYS in the same place in each message. After the line that starts with "Date:" there is one blank line, then the sender's real name is always on the next line, the sender's home address is always on the next line, and then the sender's city and zip code are always on the next line.

My question is this: What can I add to the above code so that it not only writes the "Date:" line to the output file, but also writes the 2nd, 3rd, and 4th lines after the "Date:" line to the output file? I have been unable to find anything about how to handle either multi-line or relative line references.

Second, related, question. I have started receiving what seems like a second batch of form emails. In this second batch, the sender's name and address are at the bottom of each message. It is easy enough to go through and find the start of each message. How would I do a write statement for the 1st, 2nd, 3rd, and 4th lines from the bottom of each message? To me, this seems like the same type of multi-line and/or relative line reference issue.

with open("samplefile.txt") as inf, open("samplefileout.txt", "w") as outf:
    for line in inf:
        if line.startswith("Subject: ") or line.startswith("From: "):
            outf.write(line)
        elif line.startswith("Date: "):
            outf.write(line)
            skip =     next(inf, "")    # skip blank line
            outf.write(next(inf, ""))   # 2
            outf.write(next(inf, ""))   # 3
            outf.write(next(inf, ""))   # 4

For the second question, I would think about feeding inf into a collections.deque(maxlen=4) ; when you find a bottom-of-message (before feeding it into the deque) the deque contains exactly the lines you desire.

You could read the file into an array, then use an integer that goes from 0 to the length of the file:

lines = open("test.txt").readlines()

with open("samplefileout.txt", "w") as f1:
    for x in range(0,len(lines)):
        line = lines[x].rstrip()
        if "Subject: " in line:
            f1.write(line)
        if "From: " in line:
            f1.write(line)
        if "Date: " in line:
            f1.write(line)
            f1.write(lines[x+2])
            f1.write(lines[x+3])
            f1.write(lines[x+4])

And for the last 4 lines of the file:

lines = open("test.txt").readlines()
with open("samplefileout.txt", "w") as f1:
    end = len(lines) - 1
    f1.write(lines[end-3])
    f1.write(lines[end-2])
    f1.write(lines[end-1])
    f1.write(lines[end])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM