简体   繁体   English

如何在使用熊猫导入csv文件的过程中有效删除重叠的行?

[英]How to efficiently remove overlapping rows during import of csv files using pandas?

I am trying to import csv files with pandas that look basically like this: 我正在尝试使用基本上看起来像这样的熊猫导入csv文件:

File 1: 文件1:

Date;Time;Value
2019-03-07;20:43;0.051
2019-03-07;20:44;0.048
...
2019-03-07;22:55;0.095
2019-03-07;22:56;0.098  

File 2: 档案2:

Date;Time;Value
2019-03-07;22:55;0.095
2019-03-07;22:56;0.098    
...
2019-03-08;09:10;0.024
2019-03-08;09:11;0.022

Currently I am importing the data like this: 目前,我正在像这样导入数据:

data = pd.concat([pd.read_csv(open(file),sep=';') for file in files])
data.index = pd.to_datetime(data['Date'] + ' ' + data['Time'])   

Obviously now I have the overlapping parts of the measurement data twice in my imported data frame, which plotted looks like this: 显然,现在在导入的数据框中两次出现了测量数据的重叠部分,其绘制如下所示:

重叠数据的线图

As I need to evaluate a large number of csv files I am interested what the most efficient way to handle a situation like this is. 由于我需要评估大量的csv文件,因此我想知道处理这种情况的最有效方法是什么。

I thought of these two options: 我想到了这两种选择:

  1. Import the files inside a loop and for each file only use the parts where file[i] > file[i-1] . 将文件导入循环中,并且对于每个文件,仅使用file[i] > file[i-1]
  2. Import the files as I do right now and remove the duplicates in an additional step. 像我现在一样导入文件,然后在另一个步骤中删除重复项。

Which of these options is more efficient and is there maybe a more efficient option that I didn't think of right now? 这些选项中哪一个更有效?是否有我现在未想到的更有效的选项?

As for removing duplicates, pandas has support for this: 至于删除重复项,熊猫对此提供了支持:

data = pd.concat([pd.read_csv(open(file),sep=';') for file in files])
data.index = pd.to_datetime(data['Date'] + ' ' + data['Time']
data = data[~data.index.duplicated()]

See also docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.duplicated.html 另请参阅文档: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.duplicated.html

Regarding the "best" way to do it, that depends on amount of data, other constraints, etc. Impossible to answer without more context and would likely be opinion based anyway. 关于“最佳”方式,这取决于数据量,其他约束条件等。如果没有更多背景信息,则不可能回答,并且无论如何都可能基于观点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM