简体   繁体   中英

Python Program to extract sections of a txt file from a list of words

I want a python program that should print each section of a text file. The section is defined by the keyword found from a list of words and starts with the line at which the keyword is present and ends at that line at which next section starts. For eg Consider the following text file

word1
abcdef
ghis jsd sjdhd jshj
word2
dgjgj dhkjhf
khkhkjd
word23
dfjkg fjidkfh
word5
diow299 udhgbhdi
jkdkjd
word89
eyuiywiou299092    
word3
...
...
...

Required Output by the program is:

Sections Found: [word1, word2, word3, word5, word89]

**********word1--SECTION**********
line 1: word1
line 2: abcdef
line 3: ghis jsd sjdhd jshj

**********word2--SECTION**********
line 4: word2
line 5: dgjgj dhkjhf
line 6: khkhkjd

**********word3--SECTION**********
line 14: word 3
line 15: ....

''' Suppose word4 is not found in the txt file then it should continue and move to next word found''' 
**********word5--SECTION**********
line 9: word5
line 10: diow299 udhgbhdi
line 11: jkdkjd

...
...
...
...

'''Continue till the end of list of words '''

Approach:

list_of_words = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', ....]

find the start_line for each word in list_of_word and store them in a list

then find end_line for each word by sorting the list so that it is easy to find the greatest near end line for a word

then print the section found with their line no.: line_in_text_file

Code Used for getting Line Number: (How to create a variable for each n in list_of_words)

for n in list_of_words:
    with open(file_txt, 'r', encoding="utf8") as f:
        data_file = f.readlines()
    for num, lines in enumerate(data_file, 1):
        if n in lines:
            start_line = num
        else:
            continue

Code used to find the nearest number greater than n_start_line(val) the start_line_list:

def closest(array_list, val):
    array_list1 = [j for j in array_list if j > val]
    array_list1.sort()
    return array_list1[0]

pyparsing has a generator function scanString that will yield the matched tokens and start and end locations of the match. Using the start location, call pyparsing's lineno method to get the matched line number.

import pyparsing as pp

marker = pp.oneOf("word1 word2 word3 word4 word5 word23")

txt = """\
word1
abcdef
ghis jsd sjdhd jshj
word2
dgjgj dhkjhf
khkhkjd
word23
dfjkg fjidkfh
word5
diow299 udhgbhdi word2
jkdkjd
word89
eyuiywiou299092    
word3
"""

previous = None
for t, s, e in (pp.LineStart() + marker | pp.StringEnd()).scanString(txt):
    current_line_number = pp.lineno(s, txt)
    if t:
        current = t[0]
        if previous is not None:
            print(previous, "ended on line", current_line_number - 1)
        print("found", current, "on line", current_line_number)
        previous = current
    else:
        if previous is not None:
            print(previous, "ended on line", current_line_number)

Prints:

found word1 on line 1
word1 ended on line 3
found word2 on line 4
word2 ended on line 6
found word23 on line 7
word23 ended on line 8
found word5 on line 9
word5 ended on line 13
found word3 on line 14
word3 ended on line 15

You should be able to take it from here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM