简体   繁体   中英

Split pandas dataframe column into 2 by regex

I have a data set like this and I want to split the name column into 2 columns where the Name column is overwritten as 'name and surname' and the 'middle name' column only contains the middle name, including the brackets.

In [1]: dd = {'Name' : ['Daniel [Jack] Horn', 'Marcus [Martin] Dwell', 'Greg [Alex] Waltz']}

In [2]: dd_frame = pd.DataFrame(dd)

In [3]: dd_frame
Out[3]: 
                    Name
0     Daniel [Jack] Horn
1  Marcus [Martin] Dwell
2      Greg [Alex] Waltz

The expected output is

             Name    MiddleName
0     Daniel Horn        [Jack]  
1    Marcus Dwell      [Martin]
2      Greg Waltz        [Alex]

What would be a simple way to do this without splitting into 3 columns and merging the 1st and 3rd?

Try using regex:

df = dd_frame
df['Middle Name'] = df['Name'].str.extract(r"\[(.*)\]")
df['Name'] = df['Name'].str.replace(r"\s+\[(.*)\]", "")

    Name            Middle Name
0   Daniel Horn     Jack
1   Marcus Dwell    Martin
2   Greg Waltz      Alex
df["Middle Name"] = df.Name.apply(lambda x: x.split(" ")[1][1:-1])


                Name      Middle Name
0   Daniel [Jack] Horn       Jack
1   Marcus [Martin] Dwell   Martin
2   Greg [Alex] Waltz        Alex

By far the worst way to do what you want but it works... First it splits your name on " " (space). Then the middle item of the list, is your name. Then we take whats in between the brackets. If you want to keep the brackets, remove the [1:-1]

df["Midname"] = df.Name.apply(lambda x: re.findall(r'\[[^\]]*\]',x)[0])

#output

          Name           Middle Name    MidName
0   Daniel [Jack] Horn      Jack        [Jack]
1   Marcus [Martin] Dwell   Martin      [Martin]
2   Greg [Alex] Waltz       Alex        [Alex]

This is using regex, however, I am not an expert in Regex. findall , gathers your answers between brackets, therefore, you have to take the first element in that list to avoid having [[Jack]]

An addition to the excellent answers, using string.split :

extracts = [(f"{first}{last}", middle)
             for first, middle, last in 
             dd_frame.Name.str.split("(\[.+\])")]

pd.DataFrame(extracts, columns=["Name", "MiddleName"])

        Name      MiddleName
0   Daniel Horn      [Jack]
1   Marcus Dwell    [Martin]
2   Greg Waltz      [Alex]

Here's how you can do it!

split = dd_frame['name'].split()
dd_frame['name'] = split[0] + split[1]
dd_frame['MiddleName'] = split[1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM