[英]Extract characters between two patterns using python
I have a file which has many line, and I want to extract these info in a list = ['sheep','cow','buffalo']我有一个有很多行的文件,我想在列表中提取这些信息 = ['sheep','cow','buffalo']
animal wild -list {
tiger lion hyena
}
aaaa
bbbb
cccc
animal domesticated_0 -list {
sheep
}
dddd
animal domesticated_1 -list {
cow buffalo
}
eeee
I am using the code below but it is far from what I wanted.我正在使用下面的代码,但它与我想要的相去甚远。
temp_list = ['domesticated_0','domesticated_1']
start = False
for i in temp_list:
for line in file:
if start:
f1.write(line)
if li.endswith("}"):
start = False
elif not li.startswith("animal"):
start = False
elif li.startswith("animal") and i in line:
f1.write(line)
start = True
if li.endswith("}"):
start = False
This solutions uses a regular expression:此解决方案使用正则表达式:
(?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)((?:\b\w+\b(?:\s*))+)(?:})
(?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)
matches animal
on a word boundary followed by one or more spaces followed by domesticated_
followed by either a 0
or 1
followed by one or more spaces followed by -list
followed by one or more spaces followed by {
followed by 0 or mores spaces, all in a non-capturing group. (?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)
匹配单词边界上的animal
后跟一个或多个空格后跟domesticated_
后跟0
或1
后跟一个或多个空格后跟-list
后跟一个或多个空格后跟{
后跟 0 个或多个空格,所有这些都在非捕获组中。((?:\b\w+\b(?:\s*))+)
matches 1 or more occurrences of a word on a word boundary followed by 0 or more spaces (Group 1). ((?:\b\w+\b(?:\s*))+)
匹配单词边界上后跟 0 个或多个空格的单词出现 1 次或多次(第 1 组)。(?:})
matches }
in a non-capturing group. (?:})
匹配非捕获组中的}
。 After a string of animals is captured by the above regex, for example 'cow bufallow '
, trailing spaces are removed and the string is split on spaces and appended to a list of animals:在上面的正则表达式捕获一串动物后,例如'cow bufallow '
,尾随空格被删除,字符串被空格分割并附加到动物列表中:
The code:代码:
import re
text = """
animal wild -list {
tiger
lion
hyena
}
aaaa
bbbb
cccc
animal domesticated_0 -list {sheep}
dddd
animal domesticated_1 -list {
cow
buffalo
}
eeee """
animals = []
for m in re.finditer(r'(?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)((?:\b\w+\b(?:\s*))+)(?:})', text):
animals.extend(re.split(r'\s+', m.group(1).strip()))
print(animals)
Prints:印刷:
['sheep', 'cow', 'buffalo']
You can and should replace the regex with:您可以而且应该将正则表达式替换为:
(?:\banimal\s+domesticated_\d+\s+-list\s+{\s*)((?:\b\w+\b(?:\s*))+)(?:})
if domesticated_
can be followed by any number besides 0
and 1
.如果domesticated_
后面可以跟除0
和1
之外的任何数字。
This sounds like a job for regular expressions to me.对我来说,这听起来像是正则表达式的工作。
I would do something like this:我会做这样的事情:
# Example of input
txt = """
animal wild -list {
tiger
lion
hyena
}
aaaa
bbbb
cccc
animal domesticated_0 -list {sheep}
dddd
animal domesticated_1 -list {
cow
buffalo
}
eeee
"""
import re
animals = re.findall("animal\s+domesticated_\d\s+-list\s+[{]\s*([^},]+)+\s*[}]", txt)
animals = [a.strip() for a in "\n".join(animals).split("\n") if len(a.strip()) > 0]
print(animals)
The code above outputs: ['sheep', 'cow', 'buffalo']
上面的代码输出: ['sheep', 'cow', 'buffalo']
I just did a brute force solution without using regex我只是在不使用正则表达式的情况下做了一个蛮力解决方案
a="""animal wild -list {
tiger
lion
hyena
}
aaaa
bbbb
cccc
animal domesticated_0 -list {sheep}
dddd
animal domesticated_1 -list {
cow
buffalo
}
eeee """
temp_list = ['domesticated_0','domesticated_1']
output = []
def getcontent(index,content):
temp_answer = []
while(index < len(content)):
temp_answer.append(content[index])
if '}' in content[index]:
break
index+=1
answerwithbrackets = ''.join(temp_answer)
index1=answerwithbrackets.index('{')
index2=answerwithbrackets.index("}")
return [answerwithbrackets[index1 + 1:index2 ].split(),index]
index =0
content = a.split('\n')
while (index < len(content)):
for word in temp_list:
if word in content[index]:
tempoutput =getcontent(index,content)
index = tempoutput[1]
output.extend(tempoutput[0])
index+=1
print(output)
OUTPUT输出
['sheep', 'cow', 'buffalo']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.