![](/img/trans.png)
[英]How to drop Pandas DataFrame rows with condition to keep specific column value
[英]How to drop rows of Pandas dataframe with same value based on condition in different column
我是 Python 和 Pandas 的新手,所以請耐心等待。 我想我有一個相當簡單的問題要解決,但似乎無法解決。 我有一個 csv 文件,我想用 Pandas 數據框進行編輯。 數據顯示了從家到工作地點的流量、位置各自的 id 以及緯度/經度坐標以及每個流量的值。
id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,10,"Schleswig-Holstein",54.212,9.959,7618
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2000,"Hamburg, Freie und Hansestadt",53.57071859,9.943770215,567
1001,"Flensburg",54.78879007,9.4459971,20,"Hamburg",53.575,9.941,567
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,100,"Saarland",49.379,6.979,25
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11000,"Berlin, Stadt",52.50395948,13.39337765,274
1003,"Lübeck",53.88132436,10.72749774,110,"Berlin",52.507,13.405,274
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274
我想刪除所有具有相同值的相鄰重復行,只保留最后一行,其中 id_work 是一位數或兩位數。 應刪除所有其他行。 我怎樣才能做到這一點? 我基本上需要的是以下輸出:
id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274
超級感謝任何幫助!
drop_duplicates
有一個keep
參數,將其設置為last
:
In [188]:
df.drop_duplicates(subset=['value'], keep='last')
Out[188]:
id name value
0 345 name1 456
1 12 name2 220
5 2 name6 567
其實我認為以下是你想要的:
In [197]:
df.drop(df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)])
Out[197]:
id name value
0 345 name1 456
1 12 name2 220
5 2 name6 567
在這里,我們刪除具有重復值且“id”長度不為 1 的行標簽,細分:
In [198]:
df['value'].duplicated()
Out[198]:
0 False
1 False
2 False
3 True
4 True
5 True
Name: value, dtype: bool
In [199]:
df.loc[df['value'].duplicated(), 'value']
Out[199]:
3 567
4 567
5 567
Name: value, dtype: int64
In [200]:
df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())
Out[200]:
0 False
1 False
2 True
3 True
4 True
5 True
Name: value, dtype: bool
In [201]:
(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)
Out[201]:
0 False
1 False
2 True
3 True
4 True
5 False
dtype: bool
In [202]:
df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)]
Out[202]:
Int64Index([2, 3, 4], dtype='int64')
所以上面使用duplicated
來返回重復值, unique
只返回唯一的重復值, isin
測試成員資格,我們將'id' 列轉換為str
這樣我們就可以使用str.len
測試長度並使用布爾掩碼屏蔽索引標簽。
讓我們將其簡化為只有一個數組的情況:
arr = np.array([1, 1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1])
現在讓我們生成一個 bool 數組,它向我們展示了值發生變化的地方:
arr[1:] != arr[:-1]
這告訴我們要保留哪些值——與下一個不同的值。 但它忽略了最后一個值,它應該總是被包含在內,所以:
mask = np.hstack((arr[1:] != arr[:-1], True))
現在, arr[mask]
給了我們:
array([1, 2, 0, 1, 2, 0, 2, 1, 0, 1])
如果您不相信每個元素的最后一次出現被選中,您可以檢查mask.nonzero()
以數字方式獲取索引:
array([ 2, 3, 5, 7, 8, 12, 13, 14, 16, 19])
現在您知道如何為單個列生成掩碼,您可以簡單地將其作為df[mask]
應用於整個數據幀。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.