使用re.MULTILINE和re.DOTALL一起使用python

Question

基本上輸入文件是這樣的：

> U51677人非組蛋白染色質蛋白HMG1（HMG1）基因，完整
  cds. #some records don't have this line (see below) Length = 2575 
（一些文字）

> U51677人非組蛋白染色質蛋白HMG1（HMG1）基因，完整
  Length = 2575 
（一些文字）

（等等...）

現在我寫了這個來提取以>開頭的行和長度的數字

import re
regex = re.compile("^(>.*)\r\n.*Length\s=\s(\d+)", re.MULTILINE)
match = regex.findall(sample_blast.read())

print match[0]

當長度線是>線的下一行時，它適用於提取記錄。

然后我嘗試了re.DOTALL，它應該使任何記錄匹配（。* Length），無論是否有額外的行。

regex = re.compile("^(>.*)\r\n.*(?:\r\n*.?)Length\s=\s(\d+)", re.MULTILINE|re.DOTALL)

但它不起作用。 我嘗試了re.MULTILINE和re.DOTALL而不是管道，但仍然無法正常工作。

所以問題是如何創建一個匹配記錄的正則表達式並返回所需的組，而不管記錄中是否有額外的行。 如果有人能夠在re.VERBOSE中展示這一點，那將會很好。 對不起，很長的帖子，並提前感謝您的任何幫助。 :)

Answer 1

您的問題可能是您使用\\r\\n 。 相反，請嘗試僅使用\\n ：

>>> x = """
... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
... 
...        cds. #some records don't have this line (see below)
... 
...        Length = 2575
... (some text)
... 
... >U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
... 
...        Length = 2575
... (some text)
... 
... (etc...)
... """
>>> re.search("^(>.*)\n.*(?:\n*.?)Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL)
<_sre.SRE_Match object at 0x10c937e00>
>>> _.group(2)
'2575'

另外，你的第一個.*太貪心了。 相反，嘗試使用： ^(>.*?)$.*?Length\\s=\\s(\\d+) ：

>>> re.findall("^(>.*?)$.*?Length\s=\s(\d+)", x, re.MULTILINE|re.DOTALL)
[('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575'), ('>U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575')]

Answer 2

試試這個正則表達式：

"^(>[^\r\n]*).*?Length\s=\s(\d+)"

設置兩個選項（使用管道符號）。

第一個捕獲組將匹配>之后的第一個換行符（與操作系統無關）。 那么.*? 將匹配任何字符，直到遇到第一個 Length 。 其余的與您的第一次嘗試相同。

你以前的嘗試的問題似乎是，你使用.*它可以匹配任何東西並且同時貪婪（所以它將盡可能多地消耗，包括以下Length = 2575 。

使用re.MULTILINE和re.DOTALL一起使用python

問題描述

2 個解決方案

解決方案1
5 2012-10-28 16:59:31

解決方案2
0 2012-10-28 17:01:53

使用re.MULTILINE和re.DOTALL一起使用python

問題描述

2 個解決方案

解決方案1 5 2012-10-28 16:59:31

解決方案2 0 2012-10-28 17:01:53

解決方案1
5 2012-10-28 16:59:31

解決方案2
0 2012-10-28 17:01:53