简体   繁体   中英

Extract data between two lines from text file

Say I have hundreds of text files like this example:

NAME
John Doe

DATE OF BIRTH

1992-02-16

BIO 

THIS is
 a PRETTY
 long sentence

 without ANY structure 

HOBBIES 
//..etc..

NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.

I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.

This is what I turned up with:

lines = f.readlines()
for line_number, line in enumerate(lines):
    if "NAME" in line:     
        name = lines[line_number + 1]  # In all files, Name is one line long.
    elif "DATE OF BIRTH" in line:
        date = lines[line_number + 2] # Date is also always two lines after
    elif "BIO" in line:
        for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
            if "HOBBIES" not in lines[x]:
                bio += lines[x]
            else:
                break
    elif "HOBBIES" in line:
        #...

This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.

I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.

Is it possible?

Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).

That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.

I did it storing everything in a dictionary, see code below.

f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
    if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
        text = line.replace("\n","")
        dict_text[location].append(text)
    else:
        location = "".join((line.split()))

You could use a regular expression:

import re

keys = """
NAME
DATE OF BIRTH
BIO 
HOBBIES 
""".strip().splitlines()

key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)

# uncomment to see the pattern
# print(pattern)

with open(filename) as f:
    text = f.read()
    parts = pattern.split(text)

... process parts ...

parts will be a list strings. The odd indexed positions ( parts[1] , parts[3] , ...) will be the keys ('NAME', etc) and the even indexed positions ( parts[2] , parts[4] , ...) will be the text in between the keys. parts[0] will be whatever was before the first key.

You can try the following.

keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]

f = open("data.txt", "r")
result = {}
for line in f:
    line = line.strip('\n')
    if any(v in line for v in keys):
        last_key = line
    else:
        result[last_key] = result.get(last_key, "") + line

print(result)

Output

{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}

Instead of reading lines you could cast the file as one long string. Use string.index() to find the start index of your trigger words, then set everything from that index to the next trigger word index to a variable.

Something like:

string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
   phrase_start = string.index(phrase)
   phrase_end = phrase_start + len(phrase)
   if last_phrase is not None:
      get_data(string, last_phrase, phrase_start)
   last_phrase = phrase_end

def get_data(string, previous_end_index, current_start_index):
   usable_data = string[previous_end_index: current_start_index]
   return usable_data

Better/shorter variable names should probably be used

You can just read the text in as 1 long string. And then make use of.split() This will only work if the categories are in order and don't repeat. Like so;

Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
    SplitText = Text.split(Categories[i])
    Output.update({Categories[i-1] : SplitText[0] })
    Text = SplitText[1]
Output.update({Categories[-1] : Text}) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM