简体   繁体   English

如何使用正则表达式使用 Python 按字母顺序查找字符串?

[英]How can I use Regex to find a string of characters in alphabetical order using Python?

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string.所以我遇到了一个挑战 - 在字符串中找到最长的字母字符字符串。 For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).例如,“abcghiijkyxz”应该导致“ghiijk”(是的,i 加倍了)。

I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord.我一直在用循环来解决这个问题——迭代整个字符串,然后对每个字符,使用lower和ord开始第二个循环。 No help needed writing that loop.无需帮助编写该循环。

However, it was suggested to me that Regex would be great for this sort of thing.但是,有人建议我使用 Regex 来处理这类事情。 My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist).我的正则表达式很弱(我知道如何获取静态集,我的前瞻知识扩展到知道它们存在)。 How would I write a Regex to look forward, and check future characters for being next in alphabetical order?我将如何编写一个正则表达式来展望未来,并按字母顺序检查未来的字符? Or is the suggestion to use Regex not practical for this type of thing?还是建议使用 Regex 对此类事情不切实际?

Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.编辑:普遍的共识似乎是正则表达式对于这类事情确实很糟糕。

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz .只是为了说明为什么正则表达式是不是这样的事情实用,这里是一个正则表达式,将匹配ghiijk在你给出的例子abcghiijkyxz Note it'll also match abc , y , x , z since they should technically be considered for longest string of alphabetical characters in order.请注意,它还会匹配abcyxz因为从技术上讲,它们应该被视为按顺序排列的最长字母字符串。 Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities.不幸的是,您无法单独使用正则表达式确定哪个最长,但这确实为您提供了所有可能性。 Please note that this regex works for PCRE and will not work with python's re module!请注意,此正则表达式适用于 PCRE,不适用于 python 的re模块! Also, note that python's regex library does not currently support (*ACCEPT) .另请注意, python 的regex库当前不支持(*ACCEPT) Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb , so this may currently be possible using python.虽然我还没有测试过, pyre2 包(使用 Cython 的谷歌 re2 pyre2 的 python 包装器)声称它支持(*ACCEPT)控制动词,所以目前使用 python可能是可能的。

See regex in use here请参阅此处使用的正则表达式

((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)

Results in:结果是:

abc
ghiijk
y
x
z

Explanation of a single option, ie a+(?(?!b)(*ACCEPT)) :单个选项的解释,即a+(?(?!b)(*ACCEPT))

  • a+ Matches a (literally) one or more times. a+匹配a (字面意思)一次或多次。 This catches instances where several of the same characters are in sequence such as aa .这会捕获几个相同字符按顺序排列的实例,例如aa
  • (?(?!b)(*ACCEPT)) If clause evaluating the condition. (?(?!b)(*ACCEPT)) If 子句评估条件。
    • (?!b) Condition for the if clause. (?!b) if 子句的条件。 Negative lookahead ensuring what follows is not b .负前瞻确保接下来的不是b This is because if it's not b , we want the following control verb to take effect.这是因为如果不是b ,我们希望下面的控制动词生效。
    • (*ACCEPT) If the condition (above) is met, we accept the current solution. (*ACCEPT)如果满足条件(以上),我们接受当前的解决方案。 This control verb causes the regex to end successfully, skipping the rest of the pattern.此控制动词使正则表达式成功结束,跳过模式的其余部分。 Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.由于此标记位于捕获组内,因此只有该捕获组在该特定位置成功结束,而父模式继续执行。

So what happens if the condition is not met?那么如果条件不满足会发生什么? Well, that means that (?!b) evaluated to false.嗯,这意味着(?!b)评估为假。 This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue.这意味着后面的字符实际上是b ,因此我们允许匹配(在这种情况下是捕获)继续。 Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.请注意,整个模式都包含在(?:)+ ,这允许我们匹配连续的选项,直到遇到(*ACCEPT)控制动词或行尾。

The only exception to this whole regular expression is that of z .整个正则表达式的唯一例外是z Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)) , which will ensure nothing matches after z .由于它是英文字母表中的最后一个字符(我认为这是这个问题的目标),我们不关心后面是什么,所以我们可以简单地输入z+(?(?!$)(*ACCEPT)) ,这将确保在z之后没有任何匹配。 If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here .相反,如果您想要匹配za (圆形字母顺序匹配 - idk,如果这是正确的术语,但对我来说听起来很正确),您可以使用z+(?(?!a)(*ACCEPT)))+如所见在这里

As mentioned, regex is not the best tool for this.如前所述,正则表达式不是最好的工具。 Since you are interested in a continuous sequence, you can do this with a single for loop:由于您对连续序列感兴趣,您可以使用单个 for 循环来执行此操作:

def LNDS(s):
    start = 0
    cur_len = 1
    max_len = 1
    for i in range(1,len(s)):
        if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
            cur_len += 1
        else:
            if cur_len > max_len:
                max_len = cur_len
                start = i - cur_len
            cur_len = 1
    if cur_len > max_len:
        max_len = cur_len
        start = len(s) - cur_len
    return s[start:start+max_len]

>>> LNDS('abcghiijkyxz')
'ghiijk'

We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.我们保持了我们看到的非递减字符的总数,当非递减序列结束时,我们将它与我们之前看到的最长非递减序列进行比较,如果它更长,则更新我们的“迄今为止最好看的” .

Generate all the regex substrings like ^a+b+c+$ (longest to shortest).生成所有正则表达式子串,如 ^a+b+c+$(最长到最短)。 Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.然后将每个正则表达式与“abcghiijkyxz”的所有子字符串(最长到最短)进行匹配,并在第一个匹配处停止。

def all_substrings(s):
    n = len(s)
    for i in xrange(n, 0, -1):
        for j in xrange(n - i + 1):
            yield s[j:j + i]

def longest_alphabetical_substring(s):
    for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
        r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
        for u in all_substrings(s):
            if r.match(u):
                return u

print longest_alphabetical_substring("abcghiijkyxz")

That prints "ghiijk".打印“ghiijk”。

Regex : char+ meaning a+b+c+...正则表达式char+表示a+b+c+...

Details:细节:

  • + Matches between one and unlimited times +匹配一次和无限次

Python code :蟒蛇代码

import re

def LNDS(text):
    array = []

    for y in range(97, 122):  # a - z
        st = r"%s+" % chr(y)
        for x in range(y+1, 123):  # b - z
            st += r"%s+" % chr(x)
            match = re.findall(st, text)

            if match:
                array.append(max(match, key=len))
            else:
                break

        if array:
            array = [max(array, key=len)]

    return array

Output :输出

print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']

For string abcghiijkyxz regex pattern:对于字符串abcghiijkyxz正则表达式模式:

a+b+                    i+j+k+l+
a+b+c+                  j+k+
a+b+c+d+                j+k+l+
b+c+                    k+l+
b+c+d+                  l+m+
c+d+                    m+n+
d+e+                    n+o+
e+f+                    o+p+
f+g+                    p+q+
g+h+                    q+r+
g+h+i+                  r+s+
g+h+i+j+                s+t+
g+h+i+j+k+              t+u+
g+h+i+j+k+l+            u+v+
h+i+                    v+w+
h+i+j+                  w+x+
h+i+j+k+                x+y+
h+i+j+k+l+              y+z+
i+j+
i+j+k+

Code demo代码演示

To actually "solve" the problem, you could use要真正“解决”问题,您可以使用

string = 'abcxyzghiijkl'

def sort_longest(string):
    stack = []; result = [];

    for idx, char in enumerate(string):
        c = ord(char)
        if idx == 0:
            # initialize our stack
            stack.append((char, c))
        elif idx == len(string) - 1:
            result.append(stack)
        elif c == stack[-1][1] or c == stack[-1][1] + 1:
            # compare it to the item before (a tuple)
            stack.append((char, c))
        else:
            # append the stack to the overall result
            # and reinitialize the stack
            result.append(stack)
            stack = []
            stack.append((char, c))

    return ["".join(item[0]
        for item in sublst) 
        for sublst in sorted(result, key=len, reverse=True)]

print(sort_longest(string))

Which yields哪个产量

['ghiijk', 'abc', 'xyz']

in this example.在这个例子中。


The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord() . 这个想法是循环遍历字符串并跟踪使用ord()由您的要求填充的stack变量。

It's really easy with regexps!使用正则表达式真的很容易!

(Using trailing contexts here) (在此处使用尾随上下文)

rexp=re.compile(
    "".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
            for x in "abcdefghijklmnopqrstuvwxyz"])
    +'[a-z]')

a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'

re.findall(rexp, a)

#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Python中按字母降序获得下一个字符串? - How can I get the next string, in alphabetical DESCENDING order, in Python? 如何在 python 中按字母顺序对字符串进行排序 - How can i sort a string in alphabetical order in python Python正则表达式查找所有单个字母字符 - Python regex find all single alphabetical characters 如何使用正则表达式仅提取字母字符 - How to extract only alphabetical characters using regex 如何在Python中按字母顺序对字符串排序? - How do I sort a string in alphabetical order in Python? 如何使用正则表达式查找字符串中是否包含2个特定字符,如果存在则将其删除? - How can I use regex to find if a string has 2 specific characters and remove them if they are? 如何使用 Python 在带有正则表达式的字符串中搜索/查找特殊字符,如 &amp;、&lt; 或 &gt; - How to search/find special characters like &, < or > in the string with regex using Python 如何使用for和while循环使用python按字母顺序排序? - How to use for and while loop to sort in alphabetical order using python? 如何仅使用 While 循环和条件按字母顺序排列字符串? - How can I arrange a string by its alphabetical order using only While loop and conditions? 我如何使用正则表达式从字符串中删除重复的字符 - How I can use regex to remove repeated characters from string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM