[英]finding text between two specified words in Python, when one of the two words changes
Basically, I am trying to extract text between two strings within a loop as one of the two words changes after the information is extracted. 基本上,我试图在循环中的两个字符串之间提取文本,因为提取信息后两个单词之一发生了变化。
so for example, the string is: 因此,例如,字符串为:
string = alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo
So I want to extract the text between alpha and end and then bravo and end. 因此,我想提取alpha和end之间的文本,然后提取bravo和end之间的文本。 I have quite a few of these unique words in my file so I have a list and a counter to go through them. 我的文件中有很多这样的单词,因此我有一个列表和一个计数器供您浏览。 See the code below: 请参见下面的代码:
string = 'alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo'
words = ['alpha', 'bravo'] #there will be more words here
counter = 0
stringOut = ''
#going through the list of words
while counter < len(words):
firstWord = words[counter]
lastWord = 'end'
data = string[string.find(firstWord)+len(firstWord):string.find(lastWord)].strip()
#this will give the text between the first ocurrance of "alpha" and "end"
#since I want just the smallest string between "alpha" and "end", I use another
#while loop
#to see if firstWord occurs again
while firstWord in data:
ignore,ignore2,data = data.partition(str(firstWord))
counter = counter + 1
stringOut += str(data) + str('\n')
print('output string is \n' + str(stringOut))
#this code gives the correct output for the text between the first word ("alpha") and
#"end".
#but when the list moves to the next string "bravo", it takes the text between the
#first "bravo"
#and the "end" that was associated with the information required for "alpha"
#("somethingA")
Any suggestions appreciated. 任何建议表示赞赏。 Many thanks 非常感谢
I morphed your request into a method/function (iterator). 我将您的请求转换为方法/函数(迭代器)。 I Hope this helps you :) 我希望这可以帮助你 :)
string = 'alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo'
words = ['alpha', 'bravo']
def method(string, words, end_word):
segments = string.split(end_word)
counter = 0
while counter < len(words):
data = segments[counter].split(words[counter])[-1]
counter += 1
yield data.strip()
for r in method(string, words, 'end'):
print r
>>>
somethingA
somethingB
note: this solution works if the string is being parsed forward and never needs to be looked back on. 注意:如果正对字符串进行向前解析,并且永远不需要回头,则此解决方案有效。
Please note, that without further input from you, I do not know exactly how to restrict this, but at the moment, the length of words must be equal to or less then the number of 'end_word'
in the string. 请注意,没有您的进一步输入,我不知道确切如何限制此限制,但是目前,单词的长度必须等于或小于字符串中'end_word'
的数量。
Just use regex . 只需使用regex即可 。
import re
string = 'alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo'
words = ['alpha', 'bravo'] #there will be more words here
for word in words:
expr = re.compile(r'.*' + word + '(.+?)end');
out = expr.findall(string)
print word + " => " + str(out[0])
Output: 输出:
>>>
alpha => somethingA
bravo => somethingB
Using your new subset: 使用新的子集:
string = 'alpha bravo ... alpha charlie somethingAC end ... ... bravo delta somethingBD end alpha ... bravo ...'
words = ['alpha','bravo','charlie','delta']
def method(string, words, end_word, single=True):
segments = string.split(end_word)
for word in words:
for segment in segments:
if word in segment:
data = segment.split(word)[-1]
yield (word, data.strip())
if single:
break
Notice the new argument: single
by default, only one result per word will be yeilded, but if you want, it will search for each word in each segment of the string, since I am not sure what you want, you can always remove it later. 注意新的论点: single
默认情况下,每个字只有一个结果将yeilded,但如果你想,这将搜索字符串的每个部分每个字,因为我不知道你想要什么,你可以随时将其删除后来。
# each word only once
for r in method(string, words, 'end'):
print r
>>>
('alpha', 'charlie somethingAC')
('bravo', '... alpha charlie somethingAC')
('charlie', 'somethingAC')
('delta', 'somethingBD')
and: 和:
# each word for each segment
for r in method(string, words, 'end', False):
print r
>>>
('alpha', 'charlie somethingAC')
('alpha', '... bravo ...')
('bravo', '... alpha charlie somethingAC')
('bravo', 'delta somethingBD')
('bravo', '...')
('charlie', 'somethingAC')
('delta', 'somethingBD')
As a bonus, I am including this generator expression in list-comprehension form: 另外,我将这个生成器表达式包括在列表理解形式中:
def method1(string, words, end_word, single=True):
return ([(word, segment.split(word)[-1]) for segment in string.split(end_word) if word in segment][:(1 if single else None)] for word in words)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.