简体   繁体   English

如何根据不同列中的条件删除具有相同值的 Pandas 数据框行

[英]How to drop rows of Pandas dataframe with same value based on condition in different column

I am new to Python and Pandas, so please bear with me.我是 Python 和 Pandas 的新手,所以请耐心等待。 I have a rather simple problem to solve, I suppose, but cannot seem to get it right.我想我有一个相当简单的问题要解决,但似乎无法解决。 I have a csv-file, that I would like to edit with a pandas dataframe.我有一个 csv 文件,我想用 Pandas 数据框进行编辑。 The data presents flows from home to work locations, and the locations' respective ids as well as coordinates in lat/lon and a value for each flow.数据显示了从家到工作地点的流量、位置各自的 id 以及纬度/经度坐标以及每个流量的值。

id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,10,"Schleswig-Holstein",54.212,9.959,7618
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2000,"Hamburg, Freie und Hansestadt",53.57071859,9.943770215,567
1001,"Flensburg",54.78879007,9.4459971,20,"Hamburg",53.575,9.941,567
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,100,"Saarland",49.379,6.979,25
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11000,"Berlin, Stadt",52.50395948,13.39337765,274
1003,"Lübeck",53.88132436,10.72749774,110,"Berlin",52.507,13.405,274
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274

I would like to delete all adjacent duplicate rows with the same value and only keep the last row, where id_work is either one-digit or two-digits.我想删除所有具有相同值的相邻重复行,只保留最后一行,其中 id_work 是一位数或两位数。 All other rows should be deleted.应删除所有其他行。 How can I achieve this?我怎样才能做到这一点? What I essentially need is the following output:我基本上需要的是以下输出:

   id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274

Super thankful for any help!超级感谢任何帮助!

drop_duplicates has a keep param, set this to last : drop_duplicates有一个keep参数,将其设置为last

In [188]:
df.drop_duplicates(subset=['value'], keep='last')

Out[188]:
    id   name  value
0  345  name1    456
1   12  name2    220
5    2  name6    567

Actually I think the following is what you want:其实我认为以下是你想要的:

In [197]:
df.drop(df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)])

Out[197]:
    id   name  value
0  345  name1    456
1   12  name2    220
5    2  name6    567

Here we drop the row labels that have duplicate values and where the 'id' length is not 1, a breakdown:在这里,我们删除具有重复值且“id”长度不为 1 的行标签,细分:

In [198]:
df['value'].duplicated()

Out[198]:
0    False
1    False
2    False
3     True
4     True
5     True
Name: value, dtype: bool

In [199]:
df.loc[df['value'].duplicated(), 'value']

Out[199]:
3    567
4    567
5    567
Name: value, dtype: int64

In [200]:
df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())

Out[200]:
0    False
1    False
2     True
3     True
4     True
5     True
Name: value, dtype: bool

In [201]:

(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)

Out[201]:
0    False
1    False
2     True
3     True
4     True
5    False
dtype: bool

In [202]:
df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)]

Out[202]:
Int64Index([2, 3, 4], dtype='int64')

So the above uses duplicated to return the duplicated values, unique to return just the unique duplicated values, isin to test for membership we cast the 'id' column to str so we can test the length using str.len and use the boolean mask to mask the index labels.所以上面使用duplicated来返回重复值, unique只返回唯一的重复值, isin测试成员资格,我们将'id' 列转换为str这样我们就可以使用str.len测试长度并使用布尔掩码屏蔽索引标签。

Let's simplify this to the case where you have a single array:让我们将其简化为只有一个数组的情况:

arr = np.array([1, 1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1])

Now let's generate an array of bools which shows us the places where the values change:现在让我们生成一个 bool 数组,它向我们展示了值发生变化的地方:

arr[1:] != arr[:-1]

That tells us which values we want to keep--the values which are different from the next ones.这告诉我们要保留哪些值——与下一个不同的值。 But it leaves out the last value, which should always be included, so:但它忽略了最后一个值,它应该总是被包含在内,所以:

mask = np.hstack((arr[1:] != arr[:-1], True))

Now, arr[mask] gives us:现在, arr[mask]给了我们:

array([1, 2, 0, 1, 2, 0, 2, 1, 0, 1])

And in case you don't believe the last occurrence of each element was selected, you can check mask.nonzero() to get the indexes numerically:如果您不相信每个元素的最后一次出现被选中,您可以检查mask.nonzero()以数字方式获取索引:

array([ 2,  3,  5,  7,  8, 12, 13, 14, 16, 19])

Now that you know how to generate the mask for a single column, you can simply apply it to your entire dataframe as df[mask] .现在您知道如何为单个列生成掩码,您可以简单地将其作为df[mask]应用于整个数据帧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM