[英]pandas - drop_duplicates not working as expected
Following an answer from here , I am trying to remove rows from one dataframe which are present in other dataframe.按照here的答案,我试图从一个数据帧中删除其他数据帧中存在的行。
It works well for this input:它适用于这个输入:
csv1: csv1:
sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,141000,38.423251,-121.444489
Wed May 21 00:00:00 EDT 2008,146250,38.48742
csv2: csv2:
sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,146250,38.48742
Code:代码:
>>> a = pd.read_csv('../test.csv', escapechar='\\')
>>> a
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 141000 38.423251 -121.444489
1 Wed May 21 00:00:00 EDT 2008 146250 38.487420 NaN
>>> b = pd.read_csv('../test1.csv', escapechar='\\')
>>> b
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 146250 38.48742 NaN
>>> pd.concat([a,b]).drop_duplicates(keep=False)
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 141000 38.423251 -121.444489
This is working as expected.这按预期工作。 But as soon as there are other more rows in first csv, it doesnt work.但是,只要第一个 csv 中还有其他更多行,它就不起作用。
Scenario 2 with extra row in csv1场景 2 在 csv1 中有额外的行
csv1: csv1:
sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,141000,38.423251,-121.444489
Wed May 21 00:00:00 EDT 2008,146250,38.48742
Wed May 21 00:00:00 EDT 2008,147308,38.658246a,-121.375469a
csv2: csv2:
sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,146250,38.48742
Code:代码:
>>> a = pd.read_csv('../test.csv', escapechar='\\')
>>> a
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 141000 38.423251 -121.444489
1 Wed May 21 00:00:00 EDT 2008 146250 38.48742 NaN
2 Wed May 21 00:00:00 EDT 2008 147308 38.658246a -121.375469a
>>> b = pd.read_csv('../test1.csv', escapechar='\\')
>>> b
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 146250 38.48742 NaN
>>> pd.concat([a,b]).drop_duplicates(keep=False)
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 141000 38.423251 -121.444489
1 Wed May 21 00:00:00 EDT 2008 146250 38.48742 NaN
2 Wed May 21 00:00:00 EDT 2008 147308 38.658246a -121.375469a
0 Wed May 21 00:00:00 EDT 2008 146250 38.4874 NaN
Notice that it also changed the latitude value for second duplicated in the merged row to 38.4874
from 38.48742
请注意,它还将合并行中第二个重复的纬度值从38.48742
38.4874
为38.48742
Am I missing something here or pandas has a bug?我在这里遗漏了什么或熊猫有错误吗?
Like @ayhan commented there is problem in a
DataFrame are strings between numeric in columns latitude
and longitude
, so all columns are casted to strings.就像@ayhan 评论a
那样,DataFrame 中的问题是latitude
和longitude
列中数字之间的字符串,因此所有列都转换为字符串。
In another DataFrame are columns by default casted to float
s.在另一个 DataFrame 中,列默认转换为float
s。
One possible solution is use dtype
parameter for b
DataFrame:一个可能的解决方案是使用dtype
的参数b
数据帧:
b = pd.read_csv('../test1.csv', escapechar='\\', dtype={'latitude':str, 'longitude':str})
df = pd.concat([a,b]).drop_duplicates(keep=False)
print (df)
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 141000 38.423251 -121.444489
2 Wed May 21 00:00:00 EDT 2008 147308 38.658246a -121.375469a
Or use to_numeric
for columns in a
:或者使用to_numeric
在列a
:
a['latitude'] = pd.to_numeric(a['latitude'], errors='ignore')
a['longitude'] = pd.to_numeric(a['longitude'], errors='ignore')
df = pd.concat([a,b]).drop_duplicates(keep=False)
print (df)
sale_date price latitude longitude
0 Wed May 21 00:00:00 EDT 2008 141000 38.423251 -121.444489
2 Wed May 21 00:00:00 EDT 2008 147308 38.658246a -121.375469a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.