[英]Python: re.find longest sequence
I have a string that is randomly generated:我有一个随机生成的字符串:
polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
I'd like to find the longest sequence of "diNCO diol" and the longest of "diNCO diamine".我想找到最长的“diNCO diol”序列和最长的“diNCO diamine”序列。 So in the case above the longest "diNCO diol" sequence is 1 and the longest "diNCO diamine" is 3.
所以在上面的例子中,最长的“diNCO diol”序列是1,最长的“diNCO diamine”是3。
How would I go about doing this using python's re module?我将如何使用 python 的 re 模块执行此操作?
Thanks in advance.提前致谢。
EDIT:编辑:
I mean the longest number of repeats of a given string.我的意思是给定字符串的最长重复次数。 So the longest string with "diNCO diamine" is 3:
所以带有“diNCO diamine”的最长字符串是 3:
diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine
二醇二NCO二胺二NCO二胺二NCO二胺
二NCO二醇二NCO二胺
Expanding on Ealdwulf 's answer :扩展Ealdwulf的回答:
Documentation on re.findall
can be found here .可以在此处找到有关
re.findall
文档。
def getLongestSequenceSize(search_str, polymer_str):
matches = re.findall(r'(?:\b%s\b\s?)+' % search_str, polymer_str)
longest_match = max(matches)
return longest_match.count(search_str)
This could be written as one line, but it becomes less readable in that form.这可以写成一行,但以这种形式可读性会降低。
Alternative:选择:
If polymer_str
is huge, it will be more memory efficient to use re.finditer
.如果
polymer_str
很大,那么使用re.finditer
内存效率会re.finditer
。 Here's how you might go about it:你可以这样做:
def getLongestSequenceSize(search_str, polymer_str):
longest_match = ''
for match in re.finditer(r'(?:\b%s\b\s?)+' % search_str, polymer_str):
if len(match.group(0)) > len(longest_match):
longest_match = match.group(0)
return longest_match.count(search_str)
The biggest difference between findall
and finditer
is that the first returns a list object, while the second iterates over Match objects. findall
和finditer
之间最大的区别在于,第一个返回一个列表对象,而第二个则遍历 Match 对象。 Also, the finditer
approach will be somewhat slower.此外,
finditer
方法会稍微慢一些。
I think the op wants the longest contiguous sequence.我认为操作需要最长的连续序列。 You can get all contiguous sequences like: seqs = re.findall("(?:diNCO diamine)+", polymer_str)
您可以获得所有连续序列,例如:seqs = re.findall("(?:diNCO diamine)+",polymer_str)
and then find the longest.然后找到最长的。
import re
pat = re.compile("[^|]+")
p = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine".replace("diNCO diamine","|").replace(" ","")
print max(map(len,pat.split(p)))
One was is to use findall
:一种是使用
findall
:
polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
len(re.findall("diNCO diamine", polymer_str)) # returns 4.
Using re:使用重新:
m = re.search(r"(\bdiNCO diamine\b\s?)+", polymer_str)
len(m.group(0)) / len("bdiNCO diamine")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.