简体   繁体   English

比较数据框中的两个熊猫系列并对其应用说明

[英]Compare two pandas series in a dataframe and apply instructions on that

I have been trying to compare substrings of two series from a pandas dataframe.我一直在尝试比较 Pandas 数据帧中两个系列的子字符串。 The two series are "titles" and "News" which are respectively the news headline and news body from a newspaper website that I scraped from.两个系列是“标题”和“新闻”,分别是我从一个报纸网站上抓取的新闻标题和新闻正文。 Now, many of the "News" indexes have the headline included in it at the first line and I want to remove that from the "News" series.现在,许多“新闻”索引的第一行都包含标题,我想将其从“新闻”系列中删除。

For example:例如:

df["News"][0] = "Mother Killed, police official injured in Madaripur road accidentA woman was killed .... flee the scene.AH/MUS"
df["titles"][0] = "Mother Killed, police official injured in Madaripur road accident"

I want to remove the titles from the News.我想从新闻中删除标题。 In the above example, this should yield "A woman was killed .... flee the scene.AH/MUS"在上面的例子中,这应该产生“一个女人被杀......逃离现场。AH/MUS”

I have done it like this:我是这样做的:

df["replaced"] = [(df["News"][i].replace(df["titles"][i], ""))
                   for i in range(df.shape[0])
                 ]

This does the work, but I want to know what should be the fastest method for this.这可以工作,但我想知道什么应该是最快的方法。 To be specific, I am looking for a more pandas way and don't want to loop over/use list comprehension.具体来说,我正在寻找一种更多的熊猫方式,并且不想循环/使用列表理解。 What could be a way of doing this so that I can apply it to the whole series without looping over?有什么方法可以做到这一点,以便我可以将其应用于整个系列而无需循环?

Try that it will work like charm尝试它会像魅力一样工作

def getit(row):
 try:
  return row.get("News").replace(row.get("titles"),"")
 except:
  return row.get("News") # in case row.get("titles") return non-string

df["replaced"] = df.apply(getit , axis = 1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM