简体   繁体   中英

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.

The string looks something like this in the column:

Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>

What I would like is this:

Person
------
Tom Brady
Mary Ann Thomas
John Smith

What I have so far as far as regular expressions go is this:

/^([^.]+[.]+[^.]+)[.]/g

And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.

Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:

import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'

The regular expression [AZ][az]+(\s[AZ][az]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)

I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:

import re
import pandas as pd

def extract_name(row):
    row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
    return row

df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)

and df2 has the Person column with the extracted names.

You can first identify all the alphabets in keeping things simple with this code

res =  re.findall(r"[^()0-9-]+", string)
res[1]

This should return you a list of strings ['Person', 'Tom Brady'] , you can then access the name of the Person with res[1]

** Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.

You can read more about re.findall() online or through the documentation .

Multiple ways, but you can use str.replace() :

import pandas as pd

df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
                              '<Person 456789012 Mary Ann Thomas>',
                              '<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)

print(df)

Prints:

            Person
0        Tom Brady
1  Mary Ann Thomas
2       John Smith

Pattern used: (?:<Person[\d\s]+|>) , see an online demo :

  • (?: - Open non-capture group for alternation;
    • <Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
    • | - Or;
    • > - A literal '>'
    • ) - Close group.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM