简体   繁体   中英

Using str.split with regex to split between uppercase and propercase strings

I have a column of strings containing full names. Lastnames are distinguished as groups of all-uppercase letters while Firstnames are given in propercase. The majority of names are ordered as (Firstname, LASTNAME) but many contain LASTNAME information in the middle or at the beginning of the string, as in the last entries here.

0       Manuel JOSE
1       Vincent MUANDUMBA
2       Alejandro DE LORRES
3       Luis FILIPE da Rivera
4       LIM Jock Hoi

I would like to split this column into separate Firstname and Lastname columns according to whether the text in the string is in the propercase (Firstname) or in all-caps (Lastname).

new = df["FullName"].str.split(pat=r'(?=[A-Z][a-z])', n=1, expand = True)
df['FirstName'] = new[0]
df['LastName'] = new[1]

All strings in proper or lowercase should be grouped in new[0] while all strings in uppercase should be grouped in new[1]

However, I can't seem to achieve this desired output since my regex isn't right. I've also tried pat=r'[AZ](?:[AZ]*(?![az])|[az]*)'

You can use regex:

df['LastName'] = df['FullName'].str.findall(r'\b[A-Z]+(?:\s+[A-Z]+)*\b').str.join(' ')
df['FirstName'] = df['FullName'].str.findall(r"[A-Z]{0,1}[a-z]+").str.join(' ')

Output:

                   names last_names     first_names
0            Manuel JOSE       JOSE          Manuel
1      Vincent MUANDUMBA  MUANDUMBA         Vincent
2    Alejandro DE LORRES  DE LORRES       Alejandro
3  Luis FILIPE da Rivera     FILIPE  Luis da Rivera
4           LIM Jock Hoi        LIM        Jock Hoi

This code is a bit longer than using a str pattern, but you can be sure it sends every part of the name string to firstname or lastname as you want. Trick is using.istitle() function.

# Split every string in FullName column by returning a list of words
new = df["FullName"].str.split(' ')

# Create empty lists to keep new columns for df
FirstName = []
LastName = []

# Iterate over every splitted string (sample)
for name in new:
    Proppercase =[] #This keeps values for FirstName condition
    Allcaps = [] # This keeps values for LastName (all-caps)
    # Iterate over every word in the sample
    for n in name:
        #  Check if it is proppercase or lower ('da')
        if n.istitle() or n.islower():
            Proppercase.append(n)
        # If not, it is all-caps
        else:
            Allcaps.append(n)
    # Add proppercase words to FirstName list
    FirstName.append(' '.join(Proppercase))
    # All-caps words to LastName list
    LastName.append(' '.join(Allcaps))

# Create columns
df['FirstName'] = FirstName
df['LastName'] = LastName

Output:

                FullName       FirstName   LastName
0            Manuel JOSE          Manuel       JOSE
1      Vincent MUANDUMBA         Vincent  MUANDUMBA
2    Alejandro DE LORRES       Alejandro  DE LORRES
3  Luis FILIPE da Rivera  Luis da Rivera     FILIPE
4           LIM Jock Hoi        Jock Hoi        LIM

This can be faster if you are sure first word in the name is either complete Firstname or Lastname (most of cultures but less generalizable):

new = df["FullName"].str.split(' ',1)

FirstName = []
LastName = []
for name in new:
    if name[0].istitle():
        FirstName.append(name[0])
        LastName.append(name[1])
    else:
        FirstName.append(name[1])
        LastName.append(name[0])

df['FirstName'] = FirstName
df['LastName'] = LastName

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM