在 Pandas 数据框中删除一个值以另一行的值开头的行的更多 Pythonic 方法

Question

我正在处理一个 Pandas 数据帧，如果它们包含一个已经包含在数据帧的其他“完整路径”中的“完整路径”，我想删除它们。

在下面的示例中，我想删除第 1 2 3 4 行，因为 c:/dir/ “包含”它们（我们在这里讨论的是文件系统路径）：

     Full Path        Value
0    c:/dir/            x
1    c:/dir/sub1/       x
2    c:/dir/sub2/       x
3    c:/dir/sub2/a      x
4    c:/dir/sub2/b      x
5    c:/anotherdir/     x
6    c:/anotherdir_A/   x
7    c:/anotherdir_C/   x

保留第 6 行和第 7 行，因为路径不包含在 5 中（下面我的代码中的a in b ）。

我想出的代码如下， res 是初始数据帧：

to_drop = []
for index, row in res.iterrows():
    a = row['Full Path']
    for idx, row2 in res.iterrows():
        b = row2['Full Path']
        if a != b and a in b:
            to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]

它有效，但代码对我来说并不是 100% 的 Pythonic。 我很确定有一种更优雅/更聪明的方法来做到这一点。 任何的想法？

Answer 1

pd.concat([df, df['Full Path'].str.extract('(.*:\/.*?\/)')], axis = 1)\
  .drop_duplicates([0])\
  .drop(columns = 0)

您可以使用.str.extract和 regex 来提取基本目录，将提取物连接回原始 df，删除基本目录的重复项，最后删除提取的列。

编辑：如果 Path 不按顺序进行替换：

df[df['Full Path'] == df['Full Path'].str.extract('(.*:\/.*?\/)', expand = False)]

Answer 2

这个的时间复杂度在坦克中（无论你如何转动它，你都必须检查每条路径和其他每条路径），但是使用str.startswith的单行解决方案：

df = pd.DataFrame({'Full Path': ['c:/dir/', 'c:/dir/sub/', 'c:/anotherdir/dir',
                                 'c:/anotherdir/'],
                   'Value': ['A', 'B', 'C', 'D']})

print(df[[any(a.startswith(b) if a != b else False for a in df['Full Path'])
          for b in df['Full Path']]])

输出

        Full Path Value
0         c:/dir/     A
3  c:/anotherdir/     D

在 Pandas 数据框中删除一个值以另一行的值开头的行的更多 Pythonic 方法

问题描述

2 个解决方案

解决方案1
2 2020-09-14 16:51:25

解决方案2
1 2020-09-14 16:56:01

在 Pandas 数据框中删除一个值以另一行的值开头的行的更多 Pythonic 方法

问题描述

2 个解决方案

解决方案1 2 2020-09-14 16:51:25

解决方案2 1 2020-09-14 16:56:01

解决方案1
2 2020-09-14 16:51:25

解决方案2
1 2020-09-14 16:56:01