简体   繁体   中英

How to extract multiple strings using Regex?

I have a column in a df contains the following values:

>>> import pandas as pd
>>> df = pd.DataFrame({'Sentence':['his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm', 'I have researched the product KEY_abc_def, and KEY_blt_chm as requested', 'He got the idea from your message KEY_mno_pqr']})
>>> df
                                                Sentence
0       This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm
1  I have researched the product KEY_abc_def, and KEY_blt_chm as requested
2            He got the idea from your message KEY_mno_pqr

I would like to use regex to extract the KEY into a new column without the actual "KEY_". For those sentences have more than 1 KEY, they should be joined with a comma. The output should be as below:

>>> df
                                                Sentence                               KEY
0      This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm    abc_def, mno_pqr, blt_chm
1  I have researched the product KEY_abc_def, and KEY_blt_chm as requested          abc_def, blt_chm     
2           He got the idea from your message KEY_mno_pqr                           mno_pqr  

I tried with this code but it is not working. Any suggestions would greatly be appreciated.

The code that I currently have only worked with the first KEY, and ignored the rest. I'm new with regex so any suggestions would be highly appreciated.

df['KEY']= df.sentence.str.extract("KEY_(\w+)", expand=True)

Use

df['KEY']= df.sentence.str.findall("KEY_(\w+)").str.join(",")

The Series.str.findall finds all occurrences of the captured substring and str.join(",") joins the results into a comma-separated string value.

Pandas test:

>>> df['KEY']= df['Sentence'].str.findall("KEY_(\w+)").str.join(",")
>>> df
                                                                   Sentence                      KEY
0  his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm  abc_def,mno_pqr,blt_chm
1   I have researched the product KEY_abc_def, and KEY_blt_chm as requested          abc_def,blt_chm
2                             He got the idea from your message KEY_mno_pqr                  mno_pqr

(Note in case you did not know that: I used pd.set_option('display.max_colwidth', None) to display all the data in the columns, see How to display full (non-truncated) dataframe information in html when converting from pandas dataframe to html? ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM