简体   繁体   English

正则表达式使用python(或nltk)提取两个特定单词之间的内容

[英]Regular expression to extract contents between two specific words using python(or nltk)

I'm trying to build a class and take each poem as an object, which has attributes of the title (followed by "POEM:"), author and content. 我正在尝试建立一个类并将每首诗作为一个对象,它具有标题(后跟“ POEM:”),作者和内容的属性。 I extracted title and author and put in a list. 我提取了标题和作者,并将其放在列表中。 However, I don't know how to extract the content, and put into a list. 但是,我不知道如何提取内容并将其放入列表中。

I have a txt file which includes many poems. 我有一个包含许多诗歌的txt文件。 Sample poems are: 诗歌样本为:

POEM: lala AUTHOR: la
aaaaaaaaaaaaaa,
aaaaaaaaa,
akaaaaaaaa

POEM: alal AUTHOR: al
llllllllllll,
llllll.

llllllll,
lllllllllll

POEM: lal AUTHOR:as
sssssssss,
sssssss,
sssssss

This is what I did 这就是我所做的

import re
f=open('Poems.txt', 'r')
data=f.read().replace('\n','')
re.findall(r"^POEM:.*?(?=POEM)",data)

I want to get all the poems as separate strings in a list, but I can only get the first poem. 我想将所有诗歌作为单独的字符串放在列表中,但我只能得到第一首诗。

'POEM: lala AUTHOR: la, aaaaaaaaaaaaaa, aaaaaaaaa, akaaaaaaaa'

Much easier solution without using regular expressions, explained. 不使用正则表达式的解决方案要容易得多。

Explanation line by line 逐行说明

first you open the file 首先您打开文件

f=open('Poems.txt', 'r').read()

you will get your poems list with the expected output that you show in the last part of your question 您将获得包含您问题最后一部分中显示的预期输出的诗歌列表

poems_list = ["POEM" + s for s in f.split("POEM")]

we delete the first element because it is empty, due to the split function 由于拆分功能,我们删除了第一个元素,因为它为空

poems_list.pop(0)

Up to here, poems_list would give us what the other user is posting in his question. 到目前为止, poems_list会给我们其他用户在其问题中发布的内容。 But if you actually want to parse the data, which I guess it was your intention by using regex, you can just go ahead and to the following: 但是,如果您实际上想解析数据(我想使用正则表达式是您的意图),则可以继续进行以下操作:

We go over each poem in the poem list to analyse the data that they contain 我们遍历诗歌列表中的每首诗歌,以分析其中包含的数据

for poem in poems_list:

first we split it with the poem keyword, remember that you must leave a space between the semicolon and the poem name, or it wont work (without modifying the code) 首先,我们用poem关键字将其拆分,请记住,必须在分号和诗歌名称之间留一个空格,否则它将不起作用(无需修改代码)

    i1 = poem.split('POEM: ')

now we split it by the author, again leaving the trailing spaces as appropriate. 现在,我们将其按作者划分,然后再适当保留尾随空格。 We take i1 second element because the first one was the poem name, the rest of the contend is now stored in the second element of the list. 我们将i1作为第二个元素,因为第一个是诗歌名称,其余的内容现在存储在列表的第二个元素中。

    i2 = i1[1].split(' AUTHOR: ')

again we will take the second element in the list to get the remaining part of the text. 再次,我们将使用列表中的第二个元素来获取文本的其余部分。 We split it by the new line because the poem begins after line break after stating its author 我们用新行将其分开,因为这首诗在陈述作者后在换行后开始

    i3 = i2[1].split('\n')

we save the values that we have obtained 我们保存获得的值

    poem_name = i2[0]
    poem_author = i3[0]
    poem_content = i3[1]

And now its your turn to process the data how you wish. 现在轮到您按自己的意愿处理数据了。 I recommend you to store it in a dictionary. 我建议您将其存储在字典中。

The full code 完整代码

All the code without explanation (for copy-paste): 所有没有解释的代码(用于复制粘贴):

f=open('Poems.txt', 'r').read()
poems_list = ["POEM" + s for s in f.split("POEM")]
poems_list.pop(0)

for poem in poems_list:

    i1 = poem.split('POEM: ')
    i2 = i1[1].split(' AUTHOR: ')
    i3 = i2[1].split('\n')

    poem_name = i2[0]
    poem_author = i3[0]
    poem_content = i3[1]

Further thoughts 进一步的想法

I do not recommend you to store your data like that in that file. 我不建议您将数据存储在该文件中。 It is very inefficient, and tiny modifications would cause great problems in the functioning of the code, which would require great modifications. 这是非常低效的,微小的修改会在代码的功能上造成很大的问题,这需要进行大量的修改。 Using databases, pandas, csv format or even pickle to store dictionaries is much more recommended, or at least format it a little bit better. 强烈建议使用数据库,大熊猫,csv格式甚至泡菜来存储字典,或者至少格式化一点更好。

Code

See regex in use here 查看正则表达式在这里使用

\s*(?=POEM:)

Note : The regex above is simply catching whitespace and asserting the position matches (with positive lookahead). 注意 :上面的正则表达式只是捕获空白并声明位置匹配(正向提前)。 See explanation for more details. 有关更多详细信息,请参见说明。

Usage 用法

See code in use here 在这里查看正在使用的代码

The basics 基础

import re

s = "Your string here"
r = r"\s*(?=POEM:)"

print re.split(r, s)

In practice (with your sample string) 实践中 (带有示例字符串)

import re

s = """POEM: lala AUTHOR: la
aaaaaaaaaaaaaa,
aaaaaaaaa,
akaaaaaaaa

POEM: alal AUTHOR: al
llllllllllll,
llllll.

llllllll,
lllllllllll

POEM: lal AUTHOR:as
sssssssss,
sssssss,
sssssss"""

r = r"\s*(?=POEM:)"

print re.split(r, s)

Results 结果

[
    'POEM: lala AUTHOR: la\naaaaaaaaaaaaaa,\naaaaaaaaa,\nakaaaaaaaa',
    'POEM: alal AUTHOR: al\nllllllllllll,\nllllll.\n\nllllllll,\nlllllllllll',
    'POEM: lal AUTHOR:as\nsssssssss,\nsssssss,\nsssssss'
]

Explanation 说明

  • \\s* Match any number of whitespace characters \\s*匹配任意数量的空格字符
  • (?=POEM:) Positive lookahead ensuring what follows matches POEM: literally (?=POEM:)积极的前瞻性,确保随后的内容与POEM:完全匹配

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用正则表达式从 python 中的文本中提取特定单词 - Extract specific words from text in python using regular expression python中两个单词之间的正则表达式 - regular expression in python between two words 使用正则表达式在python中提取两个字符串之间的字符串 - extract strings between two strings in python using regular expression 在python中使用正则表达式多行提取两个子字符串之间的文本 - Extract text between two substrings using regular expression multiline in python Python 正则表达式 - 查找不同行上两个特定单词之间的所有单词 - Python Regular Expression - Find all words between two specific words on different lines 使用正则表达式删除特定单词之间的单词 - Using regular expression to delete words in between specific words 使用带有NLTK的NLTK检查两个单词之间的相似性 - Check the similarity between two words with NLTK with Python Python正则表达式提取两个值之间的文本 - Python regular expression extract the text between two values Python:正则表达式以提取html中任意两个标签之间的文本 - Python: Regular expression to extract text between any two tags in a html 如何搜索文本文件的文件夹以查看是否存在特定字符串,然后使用Python提取两个单词之间的字符串? - How to search a folder of text files to see if a specific string exists and then extract a string between two words using Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM