简体   繁体   English

如何在关键字和日期前后提取文本

[英]How to extract text before and after a keyword and date

I would like to seperate the authors name, the domain and the date out of a dataframe column. 我想从数据框列中分离出作者的姓名,域名和日期。

While .split(" in ") works well to seperate the authors name on the left, I also want to seperate the domain and the date, which are not seperated through a space sign. 虽然.split(“中的”)可以很好地分隔左侧的作者姓名,但我也想分隔域和日期,但不能通过空格符号分隔。

from pandas import DataFrame

Cars = {'Details': ['Daniel Jacobs in HackeMoon.comJul 31, 2017','Wil Zelk in websiteabc.deJan 28','Wil Zelk in anotherwebsite.chJan 28, 2019'],
        }

df = DataFrame(Cars,columns= ['Details'])

print(df)
df = df.Details.str.split(" in ", expand=True)
print(df)

You can try DataFrame.str.extract for this in combination with a regex: 您可以结合使用正则表达式尝试使用DataFrame.str.extract

df['Details'].str.extract(r'(?P<author>.*?) in (?P<url>.*)(?P<date>[A-Z].*)', expand=True)

This yields: 这样产生:

          author                url          date
0  Daniel Jacobs      HackeMoon.com  Jul 31, 2017
1       Wil Zelk      websiteabc.de        Jan 28
2       Wil Zelk  anotherwebsite.ch  Jan 28, 2019

To separate the strings I make use of the following assumptions: 为了分隔字符串,我使用以下假设:

  • the name and the url are separated by " in " 名称和网址之间以“”分隔
  • the first character (and only the first character) of the date is an upper case letter (so the last upper case character in the string marks the first character of the date part) 日期的第一个字符(也是第一个字符)是大写字母(因此字符串中的最后一个大写字符表示日期部分的第一个字符)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM