[英]Capturing portions of text with non-greedy regex
我想使用re.findall提取分配給每個PCR的值。
>>> z
'PCR-09: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-11: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-12: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-13: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-14: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-15: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\nPCR-16: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r\n
>>> print z
PCR-09: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-11: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-12: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-13: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-14: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-15: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
PCR-16: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
最初,我嘗試過這種方法,但是有人可以指出所使用的正則表達式有什么問題嗎?
>>> re.search('PCR-09:(.*?)', z).groups()
('',)
非貪婪的expr (.*?)
是否應該匹配所有字符,直到找到換行符?
使用稍微修改的正則表達式,我得到了預期的結果:
>>> re.search('PCR-09:(.*?)\s\r\n', z).groups()
(' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00',)
同樣,這行不通:
>>> re.findall(r'(PCR-\d+):(.*?)', z)
[('PCR-09', ''), ('PCR-10', ''), ('PCR-11', ''), ('PCR-12', ''), ('PCR-13', ''), ('PCR-14', ''), ('PCR-15', ''), ('PCR-16', ''),
但這確實是:
>>> re.findall(r'(PCR-\d+):(.*?)\s\r\n', z,re.DOTALL)
[('PCR-09', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-10', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-11', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-12', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-13', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-14', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-15', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'), ('PCR-16', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00'),
希望有人可以解釋我的方法出了什么問題。
謝謝
r'PCR-09:(.*?)'
與您期望的不符的原因是,非貪婪的正則表達式一旦有效就會停止。
因此(.*?)
可以匹配''
,因此正則表達式立即停止。
相反, r'(PCR-\\d+):(.*?)\\s\\r\\n'
是非貪婪的,但是由於它需要找到`\\ s \\ r \\ n',因此將強制展開為工作。
我建議使用貪婪的正則表達式,其中僅包含您希望找到的字符: r'(PCR-\\d+):([0-9 ]*)'
。
模式PCR-09:(.*?)
告訴Python在PCR-09:
之后非貪婪地匹配零個或多個字符。 因此,它正是這樣做並匹配零個字符。
您需要讓Regex保持貪婪 ,以便將所有內容匹配到換行符:
>>> re.search('PCR-09:(.*)', z).groups()
(' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \r',)
>>>
請注意,您的PCR-09:(.*?)\\s\\r\\n
模式可以正常工作,因為它告訴Python在PCR-09:
之后獲得零個或多個字符, 直到 \\s\\r\\n
為止 。 換句話說,獲得它們之間的一切。
嘗試使用: split
[ x.split(':') for x in z.split('\r\n')]
輸出:
[['PCR-09', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-10', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-11', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-12', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-13', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-14', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-15', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['PCR-16', ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '], ['']]
使用正則表達式
re.findall('(PCR-\d+)(.*)',z)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.