简体   繁体   English

Python:重新查找最长序列

[英]Python: re.find longest sequence

I have a string that is randomly generated:我有一个随机生成的字符串:

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"

I'd like to find the longest sequence of "diNCO diol" and the longest of "diNCO diamine".我想找到最长的“diNCO diol”序列和最长的“diNCO diamine”序列。 So in the case above the longest "diNCO diol" sequence is 1 and the longest "diNCO diamine" is 3.所以在上面的例子中,最长的“diNCO diol”序列是1,最长的“diNCO diamine”是3。

How would I go about doing this using python's re module?我将如何使用 python 的 re 模块执行此操作?

Thanks in advance.提前致谢。

EDIT:编辑:
I mean the longest number of repeats of a given string.我的意思是给定字符串的最长重复次数。 So the longest string with "diNCO diamine" is 3:所以带有“diNCO diamine”的最长字符串是 3:
diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine二醇二NCO二胺二NCO二胺二NCO二胺NCO二醇二NCO二胺

Expanding on Ealdwulf 's answer :扩展Ealdwulf回答

Documentation on re.findall can be found here .可以在此处找到有关re.findall文档。

def getLongestSequenceSize(search_str, polymer_str):
    matches = re.findall(r'(?:\b%s\b\s?)+' % search_str, polymer_str)
    longest_match = max(matches)
    return longest_match.count(search_str)

This could be written as one line, but it becomes less readable in that form.这可以写成一行,但以这种形式可读性会降低。

Alternative:选择:

If polymer_str is huge, it will be more memory efficient to use re.finditer .如果polymer_str很大,那么使用re.finditer内存效率会re.finditer Here's how you might go about it:你可以这样做:

def getLongestSequenceSize(search_str, polymer_str):
    longest_match = ''
    for match in re.finditer(r'(?:\b%s\b\s?)+' % search_str, polymer_str):
        if len(match.group(0)) > len(longest_match):
            longest_match = match.group(0)
    return longest_match.count(search_str)

The biggest difference between findall and finditer is that the first returns a list object, while the second iterates over Match objects. findallfinditer之间最大的区别在于,第一个返回一个列表对象,而第二个则遍历 Match 对象。 Also, the finditer approach will be somewhat slower.此外, finditer方法会稍微慢一些。

I think the op wants the longest contiguous sequence.我认为操作需要最长的连续序列。 You can get all contiguous sequences like: seqs = re.findall("(?:diNCO diamine)+", polymer_str)您可以获得所有连续序列,例如:seqs = re.findall("(?:diNCO diamine)+",polymer_str)

and then find the longest.然后找到最长的。

import re
pat = re.compile("[^|]+")
p = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine".replace("diNCO diamine","|").replace(" ","")
print max(map(len,pat.split(p)))

One was is to use findall :一种是使用findall

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
len(re.findall("diNCO diamine", polymer_str)) # returns 4.

Using re:使用重新:

 m = re.search(r"(\bdiNCO diamine\b\s?)+", polymer_str)
 len(m.group(0)) / len("bdiNCO diamine")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM