简体   繁体   中英

Parse 4th capital letter of line in Python?

How can I parse lines of text from the 4th occurrence of a capital letter onward? For example given the lines:

adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj
oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ

I would like to capture:

`ZsdalkjgalsdkjTlaksdjfgasdkgj`
`PlsdakjfsldgjQ`

I'm sure there is probably a better way than regular expressions, but I was attempted to do a non-greedy match; something like this:

match = re.search(r'[A-Z].*?$', line).group()

I present two approaches.

Approach 1: all-out regex

In [1]: import re

In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'

In [3]: re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

The .*?[AZ] consumes characters up to, and including, the first uppercase letter.

The (?: ... ){3} repeats the above three times without creating any capture groups.

The following .*? matches the remaining characters before the fourth uppercase letter.

Finally, the ([AZ].*) captures the fourth uppercase letter and everything that follows into a capture group.

Approach 2: simpler regex

In [1]: import re

In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'

In [3]: ''.join(re.findall(r'[A-Z][^A-Z]*', s)[3:])
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

This attacks the problem directly, and I think is easier to read.

Anyway not using regular expressions will seen to be too verbose - although at the bytcodelevel it is a very simple algorithm running, and therefore lightweight.

It may be that regexpsare faster, since they are implemented in native code, but the "one obvious way to do it", though boring, certainly beats any suitable regexp in readability hands down:

def find_capital(string, n=4):
    count = 0
    for index, letter in enumerate(string):
        # The boolean value counts as 0 for False or 1 for True
        count += letter.isupper()  
        if count == n:
            return string[index:]
    return ""

Found this simpler to deal with by using a regular expression to split the string, then slicing the resulting list:

import re

text = ["adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj",
        "oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ"]

for t in text:
     print "".join(re.split("([A-Z])", t, maxsplit=4)[7:])

Conveniently, this gives you an empty string if there aren't enough capital letters.

A nice, one-line solution could be:

>>> s1 = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2 = 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ'
>>> s1[list(re.finditer('[A-Z]', s1))[3].start():]
'ZsdalkjgalsdkjTlaksdjfgasdkgj'
>>> s2[list(re.finditer('[A-Z]', s2))[3].start():]
'PlsdakjfsldgjQ'

Why this works (in just one line)?

  • Searches for all capital letters in the string: re.finditer('[AZ]', s1)
  • Gets the 4th capital letter found: [3]
  • Returns the position from the 4th capital letter: .start()
  • Using slicing notation, we get the part we need from the string s1[position:]

I believe that this will work for you, and be fairly easy to extend in the future:

check = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
print re.match('([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', check ).group(2)

The first part of the regex ([^AZ]*[AZ]){3} is the real key, this finds the first three upper case letters and stores them along with the characters between them in group 1, then we skip any number of non-upper case letters after the third upper case letter, and finally, we capture the rest of the string.

Testing a variety of methods. I original wrote string_after_Nth_upper and didn't post it; seeing that jsbueno's method was similar; except by doing additions/count comparisons for every character (even lowercase letters) his method is slightly slower.

s='adsasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'
import re
def string_after_Nth_upper(your_str, N=4):
    upper_count = 0
    for i, c in enumerate(your_str):
        if c.isupper():
            upper_count += 1
            if upper_count == N:
               return your_str[i:]
    return ""

def find_capital(string, n=4):
    count = 0
    for index, letter in enumerate(string):
        # The boolean value counts as 0 for False or 1 for True
        count += letter.isupper()  
        if count == n:
            return string[index:]
    return ""

def regex1(s):
    return re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
def regex2(s):
    return re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', s).group(2)
def regex3(s):
    return s[list(re.finditer('[A-Z]', s))[3].start():]
if __name__ == '__main__':
    from timeit import Timer
    t_simple = Timer("string_after_Nth_upper(s)", "from __main__ import s, string_after_Nth_upper")
    print 'simple:', t_simple.timeit()
    t_jsbueno = Timer("find_capital(s)", "from __main__ import s, find_capital")
    print 'jsbueno:', t_jsbueno.timeit()
    t_regex1 = Timer("regex1(s)", "from __main__ import s, regex1; import re")
    print  "Regex1:",t_regex1.timeit()
    t_regex2 = Timer("regex2(s)", "from __main__ import s, regex2; import re")
    print "Regex2:", t_regex2.timeit()

    t_regex3 = Timer("regex3(s)", "from __main__ import s, regex3; import re")
    print "Regex3:", t_regex3.timeit()

Results:

Simple: 4.80558681488
jsbueno: 5.92122507095
Regex1: 3.21153497696
Regex2: 2.80767202377
Regex3: 6.64155721664

So regex2 wins for time.

这不是最漂亮的方法,但是:

re.match(r'([^A-Z]*[A-Z]){3}[^A-Z]*([A-Z].*)', line).group(2)
import re
strings = [
    'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj',
    'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ',
]

for s in strings:
    m = re.match('[a-z]*[A-Z][a-z]*[A-Z][a-z]*[A-Z][a-z]*([A-Z].+)', s)
    if m:
        print m.group(1)

Parsing almost always involves regular expressions. However, a regex by itself does not make a parser. In the most simple sense, a parser consists of:

text input stream -> tokenizer 

Usually it has an additional step:

text input stream -> tokenizer -> parser

The tokenizer handles opening the input stream and collecting text in a proper manner, so that the programmer doesn't have to think about it. It consumes text elements until there is only one match available to it. Then it runs the code associated with this "token". If you don't have a tokenizer, you have to roll it yourself(in pseudocode):

while stuffInStream:
    currChars + getNextCharFromString
    if regex('firstCase'):
         do stuff
    elif regex('other stuff'):
         do more stuff

This loop code is full of gotchas, unless you build them all the time. It is also easy to have a computer produce it from a set of rules. That's how Lex/flex works. You can have the rules associated with a token pass the token to yacc/bison as your parser, which adds structure.

Notice that the lexer is just a state machine . It can do anything when it migrates from state to state. I've written lexers that used would strip characters from the input stream, open files, print text, send email and so on.

So, if all you want is to collect the text after the fourth capital letter, a regex is not only appropriate, it is the correct solution. BUT if you want to do parsing of textual input , with different rules for what to do and an unknown amount of input, then you need a lexer/parser. I suggest PLY since you are using python.

caps = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
temp = ''
for char in inputStr:
if char in caps:
    temp += char
    if len(temp) == 4:
        print temp[-1] # this is the answer that you are looking for
        break

Alternatively, you could use re.sub to get rid of anything that's not a capital letter and get the 4th character of what's left

Another version... not that pretty, but gets the job done.

def stringafter4thupper(s):    
    i,r = 0,''
    for c in s:
        if c.isupper() and i < 4:
            i+=1
        if i==4:
            r+=c
    return r

Examples:

stringafter4thupper('adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj')
stringafter4thupper('oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ')
stringafter4thupper('')
stringafter4thupper('abcdef')
stringafter4thupper('ABCDEFGH')

Respectively results:

'ZsdalkjgalsdkjTlaksdjfgasdkgj'
'PlsdakjfsldgjQ'
''
''
'DEFGH'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM