[英]Regular expression to extract contents between two specific words using python(or nltk)
I'm trying to build a class and take each poem as an object, which has attributes of the title (followed by "POEM:"), author and content. 我正在尝试建立一个类并将每首诗作为一个对象,它具有标题(后跟“ POEM:”),作者和内容的属性。 I extracted title and author and put in a list.
我提取了标题和作者,并将其放在列表中。 However, I don't know how to extract the content, and put into a list.
但是,我不知道如何提取内容并将其放入列表中。
I have a txt file which includes many poems. 我有一个包含许多诗歌的txt文件。 Sample poems are:
诗歌样本为:
POEM: lala AUTHOR: la
aaaaaaaaaaaaaa,
aaaaaaaaa,
akaaaaaaaa
POEM: alal AUTHOR: al
llllllllllll,
llllll.
llllllll,
lllllllllll
POEM: lal AUTHOR:as
sssssssss,
sssssss,
sssssss
This is what I did 这就是我所做的
import re
f=open('Poems.txt', 'r')
data=f.read().replace('\n','')
re.findall(r"^POEM:.*?(?=POEM)",data)
I want to get all the poems as separate strings in a list, but I can only get the first poem. 我想将所有诗歌作为单独的字符串放在列表中,但我只能得到第一首诗。
'POEM: lala AUTHOR: la, aaaaaaaaaaaaaa, aaaaaaaaa, akaaaaaaaa'
Much easier solution without using regular expressions, explained. 不使用正则表达式的解决方案要容易得多。
first you open the file 首先您打开文件
f=open('Poems.txt', 'r').read()
you will get your poems list with the expected output that you show in the last part of your question 您将获得包含您问题最后一部分中显示的预期输出的诗歌列表
poems_list = ["POEM" + s for s in f.split("POEM")]
we delete the first element because it is empty, due to the split function 由于拆分功能,我们删除了第一个元素,因为它为空
poems_list.pop(0)
Up to here, poems_list
would give us what the other user is posting in his question. 到目前为止,
poems_list
会给我们其他用户在其问题中发布的内容。 But if you actually want to parse the data, which I guess it was your intention by using regex, you can just go ahead and to the following: 但是,如果您实际上想解析数据(我想使用正则表达式是您的意图),则可以继续进行以下操作:
We go over each poem in the poem list to analyse the data that they contain 我们遍历诗歌列表中的每首诗歌,以分析其中包含的数据
for poem in poems_list:
first we split it with the poem keyword, remember that you must leave a space between the semicolon and the poem name, or it wont work (without modifying the code) 首先,我们用poem关键字将其拆分,请记住,必须在分号和诗歌名称之间留一个空格,否则它将不起作用(无需修改代码)
i1 = poem.split('POEM: ')
now we split it by the author, again leaving the trailing spaces as appropriate. 现在,我们将其按作者划分,然后再适当保留尾随空格。 We take i1 second element because the first one was the poem name, the rest of the contend is now stored in the second element of the list.
我们将i1作为第二个元素,因为第一个是诗歌名称,其余的内容现在存储在列表的第二个元素中。
i2 = i1[1].split(' AUTHOR: ')
again we will take the second element in the list to get the remaining part of the text. 再次,我们将使用列表中的第二个元素来获取文本的其余部分。 We split it by the new line because the poem begins after line break after stating its author
我们用新行将其分开,因为这首诗在陈述作者后在换行后开始
i3 = i2[1].split('\n')
we save the values that we have obtained 我们保存获得的值
poem_name = i2[0]
poem_author = i3[0]
poem_content = i3[1]
And now its your turn to process the data how you wish. 现在轮到您按自己的意愿处理数据了。 I recommend you to store it in a dictionary.
我建议您将其存储在字典中。
All the code without explanation (for copy-paste): 所有没有解释的代码(用于复制粘贴):
f=open('Poems.txt', 'r').read()
poems_list = ["POEM" + s for s in f.split("POEM")]
poems_list.pop(0)
for poem in poems_list:
i1 = poem.split('POEM: ')
i2 = i1[1].split(' AUTHOR: ')
i3 = i2[1].split('\n')
poem_name = i2[0]
poem_author = i3[0]
poem_content = i3[1]
I do not recommend you to store your data like that in that file. 我不建议您将数据存储在该文件中。 It is very inefficient, and tiny modifications would cause great problems in the functioning of the code, which would require great modifications.
这是非常低效的,微小的修改会在代码的功能上造成很大的问题,这需要进行大量的修改。 Using databases, pandas, csv format or even pickle to store dictionaries is much more recommended, or at least format it a little bit better.
强烈建议使用数据库,大熊猫,csv格式甚至泡菜来存储字典,或者至少格式化一点更好。
See regex in use here 查看正则表达式在这里使用
\s*(?=POEM:)
Note : The regex above is simply catching whitespace and asserting the position matches (with positive lookahead). 注意 :上面的正则表达式只是捕获空白并声明位置匹配(正向提前)。 See explanation for more details.
有关更多详细信息,请参见说明。
See code in use here 在这里查看正在使用的代码
The basics 基础
import re
s = "Your string here"
r = r"\s*(?=POEM:)"
print re.split(r, s)
In practice (with your sample string) 实践中 (带有示例字符串)
import re
s = """POEM: lala AUTHOR: la
aaaaaaaaaaaaaa,
aaaaaaaaa,
akaaaaaaaa
POEM: alal AUTHOR: al
llllllllllll,
llllll.
llllllll,
lllllllllll
POEM: lal AUTHOR:as
sssssssss,
sssssss,
sssssss"""
r = r"\s*(?=POEM:)"
print re.split(r, s)
[
'POEM: lala AUTHOR: la\naaaaaaaaaaaaaa,\naaaaaaaaa,\nakaaaaaaaa',
'POEM: alal AUTHOR: al\nllllllllllll,\nllllll.\n\nllllllll,\nlllllllllll',
'POEM: lal AUTHOR:as\nsssssssss,\nsssssss,\nsssssss'
]
\\s*
Match any number of whitespace characters \\s*
匹配任意数量的空格字符 (?=POEM:)
Positive lookahead ensuring what follows matches POEM:
literally (?=POEM:)
积极的前瞻性,确保随后的内容与POEM:
完全匹配
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.