I would like to seperate the authors name, the domain and the date out of a dataframe column.
While .split(" in ") works well to seperate the authors name on the left, I also want to seperate the domain and the date, which are not seperated through a space sign.
from pandas import DataFrame
Cars = {'Details': ['Daniel Jacobs in HackeMoon.comJul 31, 2017','Wil Zelk in websiteabc.deJan 28','Wil Zelk in anotherwebsite.chJan 28, 2019'],
}
df = DataFrame(Cars,columns= ['Details'])
print(df)
df = df.Details.str.split(" in ", expand=True)
print(df)
You can try DataFrame.str.extract
for this in combination with a regex:
df['Details'].str.extract(r'(?P<author>.*?) in (?P<url>.*)(?P<date>[A-Z].*)', expand=True)
This yields:
author url date
0 Daniel Jacobs HackeMoon.com Jul 31, 2017
1 Wil Zelk websiteabc.de Jan 28
2 Wil Zelk anotherwebsite.ch Jan 28, 2019
To separate the strings I make use of the following assumptions:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.