简体   繁体   中英

How to extract text before and after a keyword and date

I would like to seperate the authors name, the domain and the date out of a dataframe column.

While .split(" in ") works well to seperate the authors name on the left, I also want to seperate the domain and the date, which are not seperated through a space sign.

from pandas import DataFrame

Cars = {'Details': ['Daniel Jacobs in HackeMoon.comJul 31, 2017','Wil Zelk in websiteabc.deJan 28','Wil Zelk in anotherwebsite.chJan 28, 2019'],
        }

df = DataFrame(Cars,columns= ['Details'])

print(df)
df = df.Details.str.split(" in ", expand=True)
print(df)

You can try DataFrame.str.extract for this in combination with a regex:

df['Details'].str.extract(r'(?P<author>.*?) in (?P<url>.*)(?P<date>[A-Z].*)', expand=True)

This yields:

          author                url          date
0  Daniel Jacobs      HackeMoon.com  Jul 31, 2017
1       Wil Zelk      websiteabc.de        Jan 28
2       Wil Zelk  anotherwebsite.ch  Jan 28, 2019

To separate the strings I make use of the following assumptions:

  • the name and the url are separated by " in "
  • the first character (and only the first character) of the date is an upper case letter (so the last upper case character in the string marks the first character of the date part)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM