简体   繁体   English

删除 pandas 时间序列中的重复项

[英]Remove duplicates in pandas time series

I have a csv file with a time series that has the structure: col1: date col2: value .我有一个 csv 文件,其时间序列的结构为: col1: date col2: value The csv file has a date from, say, Jan 1st to April 30. I then have a second csv file with the difference that the date is Feb 1st until May 31. The values in the second column from February 1st until April 30 are the same in the first and second file. csv 文件的日期是从 1 月 1 日到 4 月 30 日。然后我有第二个 csv 文件,不同之处在于日期是 2 月 1 日到 5 月 31 日。从 2 月 1 日到 4 月 30 日的第二列中的值是在第一个和第二个文件中相同。 The same problem for a third csv file (March 1st until June 30), fourth etc.: same overalpping structure.第三个 csv 文件(3 月 1 日至 6 月 30 日),第四个等的相同问题:相同的重叠结构。 I want to read these csv files but retain only unique dates from Jan 1st until, say, December 31 without repetitions in values.我想阅读这些 csv 文件,但只保留从 1 月 1 日到 12 月 31 日的唯一日期,而没有重复值。 Is there a fast way to do this with Pandas dataframes?有没有使用 Pandas 数据帧的快速方法?

One option is concat the files using pandas pd.concat() and then try:一种选择是使用 pandas pd.concat() 连接文件,然后尝试:

df = pd.concat([file1,file2,file3])
df.drop_duplicates()

Without more info on your data, I'd probably do something like this:如果没有关于您的数据的更多信息,我可能会做这样的事情:

df1, df2, df2 = load_your_data()  # pd.DataFrame objects

import pandas as pd
concat = pd.concat([df1, df2, df2], axis=0)
dedup = concat.drop_duplicates(subset=['col1'])

This assumes that your repeated dates are indeed duplicates, and you aren't losing any information by dropping these rows.这假设您重复的日期确实是重复的,并且您不会因为删除这些行而丢失任何信息。 Otherwise, I'd consider converting the dates to a DatetimeIndex , and resampling the data with an appropriate aggregation method.否则,我会考虑将日期转换为DatetimeIndex ,并使用适当的聚合方法重新采样数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM