从依赖于另一列的熊猫列中删除字符串

Question

我有一个示例数据框：

      col1                                   col2  
0     Hello, is it me you're looking for     Hello   
1     Hello, is it me you're looking for     me 
2     Hello, is it me you're looking for     looking 
3     Hello, is it me you're looking for     for   
4     Hello, is it me you're looking for     Lionel  
5     Hello, is it me you're looking for     Richie

我想更改 col1 以便它删除 col2 中的字符串，并返回修改后的数据帧。 我还想删除字符串之前和之后的字符 1，例如，索引 1 的所需输出将是：

      col 1                                   col 2
1     Hello, is ityou're looking for          me

我曾尝试将pd.apply() 、 pd.map()与.replace()函数一起使用，但我无法让.replace()使用pd.['col2']作为参数。 我也觉得这不是最好的方法。

有什么帮助吗？ 我大多是熊猫的新手，我想学习，所以请 ELI5。

谢谢！

Answer 1

我的猜测是，您错过了“axis = 1”，因此应用不适用于列，而是适用于行

A = """Hello, is it me you're looking for;Hello
Hello, is it me you're looking for;me
Hello, is it me you're looking for;looking
Hello, is it me you're looking for;for
Hello, is it me you're looking for;Lionel
Hello, is it me you're looking for;Richie
"""
df = pd.DataFrame([a.split(";") for a in A.split("\n") ][:-1],
                   columns=["col1","col2"])

df.col1 = df.apply( lambda x: x.col1.replace( x.col2, "" )  , axis=1)

Answer 2

为数据框中的每一行做一些功能可以使用：

df.apply(func, axis=1)

func 会将每一行作为系列作为参数。

df['col1'] = df.apply(lambda row: row['col1'].replace(row['col2'],''))

但是，在前后删除一个字符需要更多的工作。

所以定义func：

def func(row):
    c1 = row['col1'] #string col1
    c2 = row['col2'] #string col2
    find_index = c1.find(c2) #first find c2 index from left
    if find_index == -1: # not find
        return c1 #not change
    else:
        start_index = max(find_index - 1, 0) #1 before but not negative
        end_index = find_index + len(c2) +1 #1 after, python will handle index overflow
        return c1.replace(c1[start_index:end_index], '') #remove

然后：

df['col1'] = df.apply(func, axis=1)

*为避免复制警告，请使用：

df = df.assign(col1=df.apply(func, axis=1))

Answer 3

也许有更pythonic或更优雅的方式，但这是我在上面快速完成的方法。 如果您不需要灵活地操作字符串并且修复速度比性能更重要，这将最有效。

我将数据框的列作为两个单独的系列取出

col1Series = df['col1']
col2Series = df['col2']

接下来创建一个空列表来存储最终的字符串值：

rowxList = []

按如下方式迭代以填充列表：

for x,y in zip(col1Series,col2Series):
    rowx  = x.replace(y,'')
    rowxList.append(rowx)

最后，将 rowxList 作为新列放回原始数据框中。 您可以替换旧列。 在新列下执行此操作更安全，并根据原始两列检查输出，然后删除不再需要的旧列：

df['newCol'] = rowxList

从依赖于另一列的熊猫列中删除字符串

问题描述

3 个解决方案

解决方案1
2 2017-11-19 13:25:35

解决方案2
2 2017-11-19 13:26:02

解决方案3
0 2020-04-30 10:41:02

从依赖于另一列的熊猫列中删除字符串

问题描述

3 个解决方案

解决方案1 2 2017-11-19 13:25:35

解决方案2 2 2017-11-19 13:26:02

解决方案3 0 2020-04-30 10:41:02

解决方案1
2 2017-11-19 13:25:35

解决方案2
2 2017-11-19 13:26:02

解决方案3
0 2020-04-30 10:41:02