简体   繁体   中英

How to get the first capital letter and then each that isn't followed by another capital letter in Python?

I am developing a script that creates abbrevations for a list of names that are too long for me to use. I need to split each name into parts divided by dots and then take each capital letter that is at a beginning of a word. Just like this:

InternetGatewayDevice.DeviceInfo.Description -> IGD.DI.D

However, if there are more consecutive capital letters (like in the following example), I only want to take the first one and then the one that is not followed by a capital letter. So, from " WANDevice " I want get " WD ". Like this:

InternetGatewayDevice.WANDevice.1.WANConnectionDevice.1.WANIPConnection.1.PortMapping.7.ExternalPort -> IGD.WD1.WCD1.WC1.PM7.EP

So far I have written this script:

data = json.load(open('./cwmp/tr069/test.json'))

def shorten(i):
    x = i.split(".")
    abbreviations = []
    for each in x:
        abbrev = ''
        for each_letter in each:
            if each_letter.isupper():
                abbrev = abbrev + each_letter
        abbreviations.append(abbrev)
    short_string = ".".join(abbreviations)
    return short_string

for i in data["mappings"]["cwmp_genieacs"]["properties"]:
    if "." in i:
        shorten(i)
    else:
        pass    

It works correctly "translates" the first example but I am not sure how to do the rest. I think if I had to, I would probably think of some way to do it (like maybe split the strings into single characters) but I am looking for an efficient & smart way to do it. I will be grateful for any advice.

I am using Python 3.6.

EDIT:

I decided to try a different approach and iterate over single characters and I pretty easily achieved what I wanted. Nevertheless, thank you for your answers and suggestions, I will most certainly go through them.

def char_by_char(i):
    abbrev= ""
    for index, each_char in enumerate(i):
        # Define previous and next characters 
        if index == 0:
            previous_char = None
        else:
            previous_char = i[index - 1]

        if index == len(i) - 1:
            next_char = None
        else:
            next_char = i[index + 1]
        # Character is uppercase
        if each_char.isupper():
            if next_char is not None:
                if next_char.isupper():
                    if (previous_char is ".") or (previous_char is None):
                        abbrev = abbrev + each_char
                    else:
                        pass
                else:
                    abbrev = abbrev + each_char
            else:
                pass
        # Character is "."
        elif each_char is ".":
            if next_char.isdigit():
                pass
            else:
                abbrev = abbrev + each_char

        # Character is a digit              
        elif each_char.isdigit():
            abbrev = abbrev + each_char

        # Character is lowercase            
        else:
            pass
    print(abbrev)


for i in data["mappings"]["cwmp_genieacs"]["properties"]:
    if "." in i:
        char_by_char(i)
    else:
        pass    

You could use a regular expression for that. For instance, you could use capture groups for the characters that you want to keep, and perform a substitution where you only keep those captured characters:

import re

def shorten(s):
    return re.sub(r'([A-Z])(?:[A-Z]*(?=[A-Z])|[^A-Z.]*)|\.(\d+)[^A-Z.]*', r'\1\2', s)  

Explanation:

  • ([AZ]) : capture a capital letter
  • (?: ) : this is a grouping to make clear what the scope is of the | operation inside of it. This is not a capture group like above (so this will be deleted)
  • [AZ]* : zero or more capital letters (greedy)
  • (?=[AZ]) : one more capital letter should follow, but don't process it -- leave it for the next match
  • | : logical OR
  • [^AZ.]* : zero or more non-capitals, non-point (following the captured capital letter): these will be deleted
  • \\.(\\d+) : a literal point followed by one or more digits: capture the digits (in order to throw away the dot).

In the replacement argument, the captured groups are injected again:

  • \\1 : first capture group (this is the capital letter)
  • \\2 : second capture group (these are the digit(s) that followed a dot)

In one match, only one of the capture groups will have something, the other will just be the empty string. But the regular expression matching is repeated throughout the whole input string.

Here is a non-regex solution.

def shorten(i):
    abr_list = []
    abrev = ''
    parts = i.split('.')
    for word in parts:
        for x in range(len(word)):
            if x == 0 and word[x].isupper() or word[x].isupper() and not word[x + 1].isupper() or word[x].isnumeric():
                abrev += word[x]
        abr_list.append(abrev)
        abrev = ''
    return join_parts(abr_list)


def join_parts(part_list):
    ret = part_list[0]
    for part in part_list[1:]:
        if not part.isnumeric():
            ret += '.%s' % part
        else:
            ret += part
    return ret
import re
def foo(s):
    print(''.join(list(map(
        lambda matchobj: matchobj[0], re.finditer(
            r'(?<![A-Z])[A-Z]|[A-Z](?![A-Z])|\.', s)))))
foo('InternetGatewayDevice.DeviceInfo.Description')
foo('WANDevice')
# output: 
# IGD.DI.D
# WD

There's three major parts to the regex:

  1. match if it's a capital letter with no capital letter in front of it (?<![AZ])[AZ] or
  2. match if it's a capital letter with no capital letter after it [AZ](?![AZ]) or
  3. if it's a literal period

https://docs.python.org/3.6/library/re.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM