简体   繁体   English

正则表达式 Python [python-2.7]

[英]Regex Python [python-2.7]

I'm working on a Python program that sifts through a .txt file to find the genus and species name.我正在开发一个 Python 程序,该程序通过一个 .txt 文件来查找属名和种名。 The lines are formatted like this (yes, the equals signs are consistently around the common name):这些行的格式如下(是的,等号始终围绕通用名称):

1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.

I can't seem to figure out a regex that will work to match only the genus and species and not the common name.我似乎无法找出一个只能匹配属和种而不是通用名称的正则表达式。 I know the equals signs (=) will probably help in some way but I cannot think of how to use them.我知道等号 (=) 可能会在某种程度上有所帮助,但我想不出如何使用它们。

Edit: Some real data:编辑:一些真实数据:

1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.

2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.

3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.

4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.

You probably don't need regex for this one.你可能不需要这个正则表达式。 If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list.如果您需要的单词的顺序和单词的数量始终相同,您可以将每一行拆分为子字符串列表,并获得该列表的第三个(属)和第四个(种)元素。 The code will probably look like that:代码可能如下所示:

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split()
    genus, species = words[2], words[3]

It just looks a little more "pythonic" to me.对我来说,它看起来更“pythonic”。

If common name can consist of multiple words, then suggested code will return an incorrect result.如果通用名称可以由多个单词组成,则建议的代码将返回不正确的结果。 To get the right result in this case too, you can use this code:为了在这种情况下也获得正确的结果,您可以使用以下代码:

myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
    words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
    genus, species = words[0], words[1]

If it is enough to capture words in groups (and you dont't wont direct match) you can try with:如果足以捕获组中的单词(并且您不会直接匹配),您可以尝试:

(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))

DEMO演示

the desired values will be in groups <genus> and <species> .所需的值将在<genus><species> The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.整个正则表达式是正向后视,因此它匹配字符串开头的零点位置,但它将一些内容捕获到组中。

  • (?=\\d\\.\\s*=[^=]+=\\s - decimal folowed by some content between equal signs and space, (?=\\d\\.\\s*=[^=]+=\\s - 小数后跟等号和空格之间的一些内容,
  • (?:(?P<genus>\\w+)\\s(?P<species>\\w+))) - capture first word to genus groups, and second word do species groups, (?:(?P<genus>\\w+)\\s(?P<species>\\w+))) - 捕获第一个词到属群,第二个词做物种群,

You can try something like:您可以尝试以下操作:

import re

txt='1. =Common Name= Genus Species some other words that I don\'t want.'

re1='.*?'   # Non-greedy match on filler
re2='(?:[a-z][a-z]+)'   # Uninteresting: word
re3='.*?'   # Non-greedy match on filler
re4='(?:[a-z][a-z]+)'   # Uninteresting: word
re5='.*?'   # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?'   # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2

rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    word1=m.group(1)
    word2=m.group(2)
    print "("+word1+")"+"("+word2+")"+"\n"

In your test input as shown in txt, this will print在您的测试输入中,如 txt 所示,这将打印

(Genus)(Species) (属)(种)

You can you this awesome site to help do regexes like this!你可以在这个很棒的网站上帮助做这样的正则表达式!

Hope this helps希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM