[英]pandas dataframe time series drop duplicates
I am trying to update temperature time series by combining 2 CSV files that may have duplicate rows at times. 我正在尝试通过组合2个CSV文件来更新温度时间序列,这些文件有时可能有重复的行。
I have tried to implement drop_duplicates
but it's not working for me. 我尝试实现
drop_duplicates
但对我来说不起作用。
Here is an example of what I'm trying to do: 这是我要执行的操作的一个示例:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
dfA = DataFrame({'date' : Series(['1/1/10','1/2/10','1/3/10','1/4/10'], index=[0,1,2,3]),
'a' : Series([60,57,56,50], index=[0,1,2,3]),
'b' : Series([80,73,76,56], index=[0,1,2,3])})
print("dfA")
print(dfA)
dfB = DataFrame({'date' : Series(['1/3/10','1/4/10','1/5/10','1/6/10'], index=[0,1,2,3]),
'a' : Series([56,50,59,75], index=[0,1,2,3]),
'b' : Series([76,56,73,89], index=[0,1,2,3])})
print("dfB")
print(dfB)
dfC = dfA.append(dfB)
print(dfC.duplicated())
dfC.drop_duplicates()
print("dfC")
print(dfC)
And this is the output: 这是输出:
dfA
a b date
0 60 80 1/1/10
1 57 73 1/2/10
2 56 76 1/3/10
3 50 56 1/4/10
dfB
a b date
0 56 76 1/3/10
1 50 56 1/4/10
2 59 73 1/5/10
3 75 89 1/6/10
0 False
1 False
2 False
3 False
0 True
1 True
2 False
3 False
dtype: bool
dfC
a b date
0 60 80 1/1/10
1 57 73 1/2/10
2 56 76 1/3/10
3 50 56 1/4/10
0 56 76 1/3/10
1 50 56 1/4/10
2 59 73 1/5/10
3 75 89 1/6/10
How do I update a time series with overlapping data and not have duplicates? 如何更新具有重叠数据且没有重复项的时间序列?
The line dfC.drop_duplicates()
does not actually change the DataFrame that dfC
is bound to (it just returns a copy of it with no duplicate rows). dfC.drop_duplicates()
行实际上并未更改dfC绑定到的dfC
(它只是返回其副本,没有重复的行)。
You can either specify that the DataFrame dfC
is modified inplace by passing in the inplace
keyword argument, 您可以指定数据帧
dfC
是通过传递修改就地inplace
关键字参数,
dfC.drop_duplicates(inplace=True)
or rebind the view of the de-duplicated DataFrame to the name dfC
like this 或将经过重复数据删除的DataFrame的视图重新绑定到名称
dfC
如下所示
dfC = dfC.drop_duplicates()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.