繁体   English   中英

Python 程序从单词列表中提取 txt 文件的部分

[英]Python Program to extract sections of a txt file from a list of words

我想要一个 python 程序,它应该打印文本文件的每个部分。 该部分由从单词列表中找到的关键字定义,并从关键字所在的行开始,并在下一部分开始的那一行结束。 例如考虑以下文本文件

word1
abcdef
ghis jsd sjdhd jshj
word2
dgjgj dhkjhf
khkhkjd
word23
dfjkg fjidkfh
word5
diow299 udhgbhdi
jkdkjd
word89
eyuiywiou299092    
word3
...
...
...

程序所需的 Output为:

Sections Found: [word1, word2, word3, word5, word89]

**********word1--SECTION**********
line 1: word1
line 2: abcdef
line 3: ghis jsd sjdhd jshj

**********word2--SECTION**********
line 4: word2
line 5: dgjgj dhkjhf
line 6: khkhkjd

**********word3--SECTION**********
line 14: word 3
line 15: ....

''' Suppose word4 is not found in the txt file then it should continue and move to next word found''' 
**********word5--SECTION**********
line 9: word5
line 10: diow299 udhgbhdi
line 11: jkdkjd

...
...
...
...

'''Continue till the end of list of words '''

方法:

list_of_words = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', ....]

在 list_of_word 中找到每个单词的 start_line 并将它们存储在列表中

然后通过排序列表找到每个单词的end_line,以便轻松找到单词的最大近端行

然后打印找到的部分及其行号:line_in_text_file

用于获取行号的代码:(如何为 list_of_words 中的每个 n 创建一个变量)

for n in list_of_words:
    with open(file_txt, 'r', encoding="utf8") as f:
        data_file = f.readlines()
    for num, lines in enumerate(data_file, 1):
        if n in lines:
            start_line = num
        else:
            continue

用于查找大于 n_start_line(val) start_line_list 的最接近数的代码:

def closest(array_list, val):
    array_list1 = [j for j in array_list if j > val]
    array_list1.sort()
    return array_list1[0]

pyparsing 有一个生成器 function scanString ,它将产生匹配的标记以及匹配的开始和结束位置。 使用起始位置,调用 pyparsing 的lineno方法获取匹配的行号。

import pyparsing as pp

marker = pp.oneOf("word1 word2 word3 word4 word5 word23")

txt = """\
word1
abcdef
ghis jsd sjdhd jshj
word2
dgjgj dhkjhf
khkhkjd
word23
dfjkg fjidkfh
word5
diow299 udhgbhdi word2
jkdkjd
word89
eyuiywiou299092    
word3
"""

previous = None
for t, s, e in (pp.LineStart() + marker | pp.StringEnd()).scanString(txt):
    current_line_number = pp.lineno(s, txt)
    if t:
        current = t[0]
        if previous is not None:
            print(previous, "ended on line", current_line_number - 1)
        print("found", current, "on line", current_line_number)
        previous = current
    else:
        if previous is not None:
            print(previous, "ended on line", current_line_number)

印刷:

found word1 on line 1
word1 ended on line 3
found word2 on line 4
word2 ended on line 6
found word23 on line 7
word23 ended on line 8
found word5 on line 9
word5 ended on line 13
found word3 on line 14
word3 ended on line 15

你应该可以从这里拿走它。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM