简体   繁体   English

在Python中解析第4行大写字母?

[英]Parse 4th capital letter of line in Python?

How can I parse lines of text from the 4th occurrence of a capital letter onward? 如何解析大写字母第4次出现的文本行? For example given the lines: 例如给出以下行:

adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj
oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ

I would like to capture: 我想抓住:

`ZsdalkjgalsdkjTlaksdjfgasdkgj`
`PlsdakjfsldgjQ`

I'm sure there is probably a better way than regular expressions, but I was attempted to do a non-greedy match; 我确信有可能比正则表达式更好的方式,但我试图做一个非贪婪的比赛; something like this: 这样的事情:

match = re.search(r'[A-Z].*?$', line).group()

I present two approaches. 我提出两种方法。

Approach 1: all-out regex 方法1:全力以赴的正则表达式

In [1]: import re

In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'

In [3]: re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

The .*?[AZ] consumes characters up to, and including, the first uppercase letter. .*?[AZ]消费字符,包括第一个大写字母。

The (?: ... ){3} repeats the above three times without creating any capture groups. (?: ... ){3}重复上述三次而不创建任何捕获组。

The following .*? 以下.*? matches the remaining characters before the fourth uppercase letter. 匹配第四个大写字母前的剩余字符。

Finally, the ([AZ].*) captures the fourth uppercase letter and everything that follows into a capture group. 最后, ([AZ].*)捕获第四个大写字母以及随后进入捕获组的所有内容。

Approach 2: simpler regex 方法2:更简单的正则表达式

In [1]: import re

In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'

In [3]: ''.join(re.findall(r'[A-Z][^A-Z]*', s)[3:])
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

This attacks the problem directly, and I think is easier to read. 这会直接攻击问题,我认为更容易阅读。

Anyway not using regular expressions will seen to be too verbose - although at the bytcodelevel it is a very simple algorithm running, and therefore lightweight. 无论如何不使用正则表达式将被视为过于冗长 - 虽然在bytcodelevel它是一个非常简单的算法运行,因此轻量级。

It may be that regexpsare faster, since they are implemented in native code, but the "one obvious way to do it", though boring, certainly beats any suitable regexp in readability hands down: 可能是regexps更快,因为它们是用本机代码实现的,但“一种显而易见的方法”虽然很无聊,但在可读性方面肯定胜过任何合适的正则表达式:

def find_capital(string, n=4):
    count = 0
    for index, letter in enumerate(string):
        # The boolean value counts as 0 for False or 1 for True
        count += letter.isupper()  
        if count == n:
            return string[index:]
    return ""

Found this simpler to deal with by using a regular expression to split the string, then slicing the resulting list: 通过使用正则表达式拆分字符串,然后切片结果列表,发现这个更简单:

import re

text = ["adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj",
        "oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ"]

for t in text:
     print "".join(re.split("([A-Z])", t, maxsplit=4)[7:])

Conveniently, this gives you an empty string if there aren't enough capital letters. 方便的是,如果没有足够的大写字母,这会给你一个空字符串。

A nice, one-line solution could be: 一个不错的单线解决方案可能是:

>>> s1 = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2 = 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ'
>>> s1[list(re.finditer('[A-Z]', s1))[3].start():]
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2[list(re.finditer('[A-Z]', s2))[3].start():]
'PlsdakjfsldgjQ'

Why this works (in just one line)? 为什么会这样(只有一行)?

  • Searches for all capital letters in the string: re.finditer('[AZ]', s1) 搜索字符串中的所有大写字母: re.finditer('[AZ]', s1)
  • Gets the 4th capital letter found: [3] 获取第四个大写字母: [3]
  • Returns the position from the 4th capital letter: .start() 返回第4个大写字母的位置: .start()
  • Using slicing notation, we get the part we need from the string s1[position:] 使用切片表示法,我们从字符串s1[position:]获取我们需要的部分s1[position:]

I believe that this will work for you, and be fairly easy to extend in the future: 我相信这对你有用,并且在未来很容易扩展:

check = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
print re.match('([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', check ).group(2)

The first part of the regex ([^AZ]*[AZ]){3} is the real key, this finds the first three upper case letters and stores them along with the characters between them in group 1, then we skip any number of non-upper case letters after the third upper case letter, and finally, we capture the rest of the string. 正则表达式的第一部分([^AZ]*[AZ]){3}是真正的密钥,它找到前三个大写字母并将它们与它们之间的字符一起存储在组1中,然后我们跳过任何数字在第三个大写字母后面的非大写字母,最后,我们捕获字符串的其余部分。

Testing a variety of methods. 测试各种方法。 I original wrote string_after_Nth_upper and didn't post it; 我原来写了string_after_Nth_upper并没有发布; seeing that jsbueno's method was similar; 看到j​​sbueno的方法是相似的; except by doing additions/count comparisons for every character (even lowercase letters) his method is slightly slower. 除了对每个字符(甚至是小写字母)进行加法/计数比较之外,他的方法稍慢。

s='adsasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
import re
def string_after_Nth_upper(your_str, N=4):
    upper_count = 0
    for i, c in enumerate(your_str):
        if c.isupper():
            upper_count += 1
            if upper_count == N:
               return your_str[i:]
    return ""

def find_capital(string, n=4):
    count = 0
    for index, letter in enumerate(string):
        # The boolean value counts as 0 for False or 1 for True
        count += letter.isupper()  
        if count == n:
            return string[index:]
    return ""

def regex1(s):
    return re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
def regex2(s):
    return re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', s).group(2)
def regex3(s):
    return s[list(re.finditer('[A-Z]', s))[3].start():]
if __name__ == '__main__':
    from timeit import Timer
    t_simple = Timer("string_after_Nth_upper(s)", "from __main__ import s, string_after_Nth_upper")
    print 'simple:', t_simple.timeit()
    t_jsbueno = Timer("find_capital(s)", "from __main__ import s, find_capital")
    print 'jsbueno:', t_jsbueno.timeit()
    t_regex1 = Timer("regex1(s)", "from __main__ import s, regex1; import re")
    print  "Regex1:",t_regex1.timeit()
    t_regex2 = Timer("regex2(s)", "from __main__ import s, regex2; import re")
    print "Regex2:", t_regex2.timeit()

    t_regex3 = Timer("regex3(s)", "from __main__ import s, regex3; import re")
    print "Regex3:", t_regex3.timeit()

Results: 结果:

Simple: 4.80558681488
jsbueno: 5.92122507095
Regex1: 3.21153497696
Regex2: 2.80767202377
Regex3: 6.64155721664

So regex2 wins for time. 因此regex2赢得了时间。

这不是最漂亮的方法,但是:

re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', line).group(2)
import re
strings = [
    'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj',
    'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ',
]

for s in strings:
    m = re.match('[a-z]*[A-Z][a-z]*[A-Z][a-z]*[A-Z][a-z]*([A-Z].+)', s)
    if m:
        print m.group(1)

Parsing almost always involves regular expressions. 解析几乎总是涉及正则表达式。 However, a regex by itself does not make a parser. 但是,正则表达式本身并不构成解析器。 In the most simple sense, a parser consists of: 从最简单的意义上讲,解析器包括:

text input stream -> tokenizer 

Usually it has an additional step: 通常它还有一个额外的步骤:

text input stream -> tokenizer -> parser

The tokenizer handles opening the input stream and collecting text in a proper manner, so that the programmer doesn't have to think about it. 标记生成器处理打开输入流并以适当的方式收集文本,以便程序员不必考虑它。 It consumes text elements until there is only one match available to it. 它消耗文本元素,直到只有一个匹配可用。 Then it runs the code associated with this "token". 然后它运行与此“令牌”关联的代码。 If you don't have a tokenizer, you have to roll it yourself(in pseudocode): 如果你没有tokenizer,你必须自己滚动它(在伪代码中):

while stuffInStream:
    currChars + getNextCharFromString
    if regex('firstCase'):
         do stuff
    elif regex('other stuff'):
         do more stuff

This loop code is full of gotchas, unless you build them all the time. 这个循环代码充满了陷阱,除非你一直构建它们。 It is also easy to have a computer produce it from a set of rules. 计算机也可以很容易地从一组规则中生成它。 That's how Lex/flex works. 这就是Lex / flex的工作方式。 You can have the rules associated with a token pass the token to yacc/bison as your parser, which adds structure. 您可以让与令牌关联的规则将令牌传递给yacc / bison作为解析器,从而添加结构。

Notice that the lexer is just a state machine . 请注意,词法分析器只是一个状态机 It can do anything when it migrates from state to state. 当它从一个州迁移到另一个州时,它可以做任何事情 I've written lexers that used would strip characters from the input stream, open files, print text, send email and so on. 我写了一些词法分析器,用于从输入流中删除字符,打开文件,打印文本,发送电子邮件等等。

So, if all you want is to collect the text after the fourth capital letter, a regex is not only appropriate, it is the correct solution. 所以,如果您只想在第四个大写字母后收集文本,那么正则表达式不仅合适,而且是正确的解决方案。 BUT if you want to do parsing of textual input , with different rules for what to do and an unknown amount of input, then you need a lexer/parser. 但是如果你想要解析文本输入 ,使用不同的规则来做什么和未知的输入量,那么你需要一个词法分析器/解析器。 I suggest PLY since you are using python. 我建议PLY,因为你使用python。

caps = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
temp = ''
for char in inputStr:
if char in caps:
    temp += char
    if len(temp) == 4:
        print temp[-1] # this is the answer that you are looking for
        break

Alternatively, you could use re.sub to get rid of anything that's not a capital letter and get the 4th character of what's left 或者,您可以使用re.sub来删除任何不是大写字母的内容并获取剩下的第四个字符

Another version... not that pretty, but gets the job done. 另一个版本......不是那么漂亮,但完成工作。

def stringafter4thupper(s):    
    i,r = 0,''
    for c in s:
        if c.isupper() and i < 4:
            i+=1
        if i==4:
            r+=c
    return r

Examples: 例子:

stringafter4thupper('adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj')
stringafter4thupper('oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ')
stringafter4thupper('')
stringafter4thupper('abcdef')
stringafter4thupper('ABCDEFGH')

Respectively results: 分别结果:

'ZsdalkjgalsdkjTlaksdjfgasdkgj'
'PlsdakjfsldgjQ'
''
''
'DEFGH'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM