使用python提取两个模式之间的字符

Question

I have a file which has many line, and I want to extract these info in a list = ['sheep','cow','buffalo']我有一个有很多行的文件，我想在列表中提取这些信息 = ['sheep','cow','buffalo']

animal wild -list {
    tiger lion hyena
}
aaaa 
bbbb 
cccc 
animal domesticated_0 -list {
    sheep
}
dddd 
animal domesticated_1 -list {
    cow buffalo
}
eeee

I am using the code below but it is far from what I wanted.我正在使用下面的代码，但它与我想要的相去甚远。

temp_list = ['domesticated_0','domesticated_1']
start = False

for i in temp_list:
   for line in file:   
      if start:
         f1.write(line)
         if li.endswith("}"):
            start = False
      elif not li.startswith("animal"):
         start = False
      elif li.startswith("animal") and i in line:
         f1.write(line)
         start = True
         if li.endswith("}"):
            start = False

Answer 1

This solutions uses a regular expression:此解决方案使用正则表达式：

(?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)((?:\b\w+\b(?:\s*))+)(?:})

(?:\banimal\s+domesticated_[01]\s+-list\s+{\s*) matches animal on a word boundary followed by one or more spaces followed by domesticated_ followed by either a 0 or 1 followed by one or more spaces followed by -list followed by one or more spaces followed by { followed by 0 or mores spaces, all in a non-capturing group. (?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)匹配单词边界上的animal后跟一个或多个空格后跟domesticated_后跟0或1后跟一个或多个空格后跟-list后跟一个或多个空格后跟{后跟 0 个或多个空格，所有这些都在非捕获组中。
((?:\b\w+\b(?:\s*))+) matches 1 or more occurrences of a word on a word boundary followed by 0 or more spaces (Group 1). ((?:\b\w+\b(?:\s*))+)匹配单词边界上后跟 0 个或多个空格的单词出现 1 次或多次（第 1 组）。
(?:}) matches } in a non-capturing group. (?:})匹配非捕获组中的} 。

After a string of animals is captured by the above regex, for example 'cow bufallow ' , trailing spaces are removed and the string is split on spaces and appended to a list of animals:在上面的正则表达式捕获一串动物后，例如'cow bufallow ' ，尾随空格被删除，字符串被空格分割并附加到动物列表中：

The code:代码：

import re

text = """
animal   wild  -list  {
                            tiger
                           lion
                          hyena
         }
aaaa
bbbb
cccc
animal   domesticated_0  -list  {sheep}
dddd
animal   domesticated_1  -list  {
                            cow
                           buffalo
         }
eeee """

animals = []
for m in re.finditer(r'(?:\banimal\s+domesticated_[01]\s+-list\s+{\s*)((?:\b\w+\b(?:\s*))+)(?:})', text):
    animals.extend(re.split(r'\s+', m.group(1).strip()))
print(animals)

Prints:印刷：

['sheep', 'cow', 'buffalo']

You can and should replace the regex with:您可以而且应该将正则表达式替换为：

(?:\banimal\s+domesticated_\d+\s+-list\s+{\s*)((?:\b\w+\b(?:\s*))+)(?:})

if domesticated_ can be followed by any number besides 0 and 1 .如果domesticated_后面可以跟除0和1之外的任何数字。

See Demo看演示

Answer 2

This sounds like a job for regular expressions to me.对我来说，这听起来像是正则表达式的工作。

I would do something like this:我会做这样的事情：

# Example of input
txt = """

animal   wild  -list  {
                            tiger
                           lion
                          hyena
         } 
aaaa
bbbb
cccc
animal   domesticated_0  -list  {sheep}  
dddd
animal   domesticated_1  -list  {
                            cow
                           buffalo
         }
eeee 
"""

import re

animals = re.findall("animal\s+domesticated_\d\s+-list\s+[{]\s*([^},]+)+\s*[}]", txt)
animals = [a.strip() for a in "\n".join(animals).split("\n") if len(a.strip()) > 0]
print(animals)

The code above outputs: ['sheep', 'cow', 'buffalo']上面的代码输出： ['sheep', 'cow', 'buffalo']

Answer 3

I just did a brute force solution without using regex我只是在不使用正则表达式的情况下做了一个蛮力解决方案

a="""animal   wild  -list  {
                            tiger
                           lion
                          hyena
         } 
aaaa
bbbb
cccc
animal   domesticated_0  -list  {sheep}  
dddd
animal   domesticated_1  -list  {
                            cow
                           buffalo
         }
eeee """
temp_list = ['domesticated_0','domesticated_1']
output = []


def getcontent(index,content):
  temp_answer = []
  while(index < len(content)):
    temp_answer.append(content[index])
    if '}' in content[index]:
      break
    index+=1
  answerwithbrackets = ''.join(temp_answer)
  index1=answerwithbrackets.index('{')
  index2=answerwithbrackets.index("}")
  return [answerwithbrackets[index1 + 1:index2 ].split(),index]



index =0
content = a.split('\n')
while (index < len(content)):
  for word in temp_list:
    if word in content[index]:
      tempoutput =getcontent(index,content)
      index = tempoutput[1]
      output.extend(tempoutput[0])
  index+=1
print(output)

OUTPUT输出

['sheep', 'cow', 'buffalo']

使用python提取两个模式之间的字符

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-11-29 12:54:24

解决方案2
0 2019-11-29 12:21:37

解决方案3
0 2019-11-29 12:57:21

使用python提取两个模式之间的字符

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-11-29 12:54:24

解决方案2 0 2019-11-29 12:21:37

解决方案3 0 2019-11-29 12:57:21

解决方案1
1 已采纳 2019-11-29 12:54:24

解决方案2
0 2019-11-29 12:21:37

解决方案3
0 2019-11-29 12:57:21