简体   繁体   English

迭代Pandas DataFrame并插入行的最快方法

[英]Fastest way to iterate over Pandas DataFrame and insert a Row

I am building a tool to help automate reviewing of data from several laboratory setups on a weekly basis. 我正在构建一个工具,以帮助每周自动审查来自多个实验室设置的数据。 A tab-delimited text file is generated each day. 每天生成制表符分隔的文本文件。 Each row represents data taken every 2 seconds so there are 43200 rows and many columns (each file is 75mb) 每行代表每2秒采集的数据,因此有43200行和多列(每个文件为75mb)

I am loading the seven text files using pandas.readcsv and only extracting the three columns I need into a pandas dataframe. 我正在使用pandas.readcsv加载七个文本文件,并且只将我需要的三个列提​​取到pandas数据帧中。 This is slower than I'd like but acceptable. 这比我想要的慢,但可以接受。 Then, I plot the data using Plotly offline to view the interactive plot. 然后,我使用Plotly离线绘制数据以查看交互式绘图。 This is a scheduled task set up to run once a week. 这是一项计划任务,设置为每周运行一次。

The data is plotted vs date & time. 数据与日期和时间相对应。 Often times the test setups are offline temporarily and there will be gaps in the data. 通常,测试设置暂时脱机,数据中会有间隙。 Unfortunately when this is plotted all the data is connected by lines, even if the test was offline for a period of hours or days. 不幸的是,当绘制它时,所有数据都是按行连接的,即使测试离线数小时或数天。

The only way to prevent this is to insert a line with a date between the two dates with actual data and NaNs for all the missing data. 防止这种情况的唯一方法是在两个日期之间插入一行日期与实际数据和所有缺失数据的NaN。 I have implemented this for a missing data file easily enough however I want to generalize this for any gaps in data greater than a certain time period. 我已经很容易地为缺失的数据文件实现了这个,但是我希望对大于某个时间段的数据中的任何间隙进行概括。 I came up with a solution that seems to work but it is REALLY slow: 我提出了一个似乎有效的解决方案,但它真的很慢:

# alldata is a pandas dataframe with 302,000 rows and 4 columns
# one datetime column and three float32 columns

alldata_gaps  = pandas.DataFrame() #new dataframe with gaps in it

#iterate over all rows. If the datetime difference between 
#two consecutive rows is more than one minute, insert a gap row.

for i in range(0, len(alldata)):
    alldata_gaps = alldata_gaps.append(alldata.iloc[i])
    if alldata.iloc[i+1, 0]-alldata.iloc[i,0] > datetime.timedelta(minutes=1):
        Series = pandas.Series({'datetime' : alldata.iloc[i,0]
        +datetime.timedelta(seconds=3)})
        alldata_gaps = alldata_gaps.append(Series)
        print(Series)

Does anyone have a suggestion how I could speed this operation up so it doesn't take such an obnoxiously long time? 有没有人有一个建议,我怎么能加快这个操作,所以它不需要如此令人讨厌的长时间?

Here's a dropbox link to an example data file with only 100 lines 这是一个只有100行的示例数据文件的保管箱链接

Here's a link to my current script without adding the gap rows 这是我当前脚本的链接,没有添加间隙行

Almost certainly your bottleneck is from pd.DataFrame.append : 几乎可以肯定,您的瓶颈来自pd.DataFrame.append

alldata_gaps = alldata_gaps.append(alldata.iloc[i])
alldata_gaps = alldata_gaps.append(Series)

As an aside, you've confusingly named a variable the same as a Pandas object pd.Series . pd.Series一下,你混淆地将变量命名为与Pandas对象pd.Series相同。 It's good practice to avoid such ambiguity. 避免这种模糊性是一种很好的做法。

A much more efficient solution is to: 一个有效的解决方案是:

  1. Identify times after which gaps occur. 确定出现差距的时间。
  2. Create a single dataframe with data for these times + 3 seconds. 使用这些时间+ 3秒的数据创建单个数据帧。
  3. Append to your existing dataframe and sort by time. 附加到现有数据框并按时间排序。

So let's have a stab with a sample dataframe: 那么让我们来看一个示例数据帧:

# example dataframe setup
df = pd.DataFrame({'Date': ['00:10:15', '00:15:20', '00:15:40', '00:16:50', '00:17:55',
                            '00:19:00', '00:19:10', '00:19:15', '00:19:55', '00:20:58'],
                   'Value': list(range(10))})

df['Date'] = pd.to_datetime('2018-11-06-' + df['Date'])

# find gaps greater than 1 minute
bools = (df['Date'].diff().dt.seconds > 60).shift(-1).fillna(False)
idx = bools[bools].index
# Int64Index([0, 2, 3, 4, 8], dtype='int64')

# construct dataframe to append
df_extra = df.loc[idx].copy().assign(Value=np.nan)

# add 3 seconds
df_extra['Date'] = df_extra['Date'] + pd.to_timedelta('3 seconds')

# append to original
res = df.append(df_extra).sort_values('Date')

Result: 结果:

print(res)

                 Date  Value
0 2018-11-06 00:10:15    0.0
0 2018-11-06 00:10:18    NaN
1 2018-11-06 00:15:20    1.0
2 2018-11-06 00:15:40    2.0
2 2018-11-06 00:15:43    NaN
3 2018-11-06 00:16:50    3.0
3 2018-11-06 00:16:53    NaN
4 2018-11-06 00:17:55    4.0
4 2018-11-06 00:17:58    NaN
5 2018-11-06 00:19:00    5.0
6 2018-11-06 00:19:10    6.0
7 2018-11-06 00:19:15    7.0
8 2018-11-06 00:19:55    8.0
8 2018-11-06 00:19:58    NaN
9 2018-11-06 00:20:58    9.0

My general idea is the same as jpp's answer: instead of iterating the dataframe (which is slow for the amount of data you have), you should just identify the rows of interest and work with those. 我的一般想法与jpp的答案相同:不是迭代数据框(对于你拥有的数据量来说很慢),你应该只识别感兴趣的行并使用它们。 Main differences are 1) turning multiple columns to NA and 2) adjusting the NA row time stamp to be half way between the surrounding times 主要区别在于1)将多列转为NA和2)将NA行时间戳调整为周围时间的一半

I've added explanations throughout as comments... 我在评论中添加了解释......

# after you read in your data, make sure the time column is actually a datetime
df['datetime'] = pd.to_datetime(df['datetime'])

# calculate the (time) difference between a row and the previous row
df['time_diff'] = df['datetime'].diff()

# create a subset of your df where the time difference is greater than
# some threshold. This will be a dataframe of your empty/NA rows.
# I've set a 2 second threshold here because of the sample data you provided, 
# but could be any number of seconds
empty = df[df['time_diff'].dt.total_seconds() > 2].copy()

# calculate the correct timestamp for the NA rows (halfway and evenly spaced)
empty['datetime'] = empty['datetime'] - (empty['time_diff'].shift(-1) / 2)

# set all the columns to NA apart from the datetime column
empty.loc[:, ~empty.columns.isin(['datetime'])] = np.nan

# append this NA/empty dataframe to your original data, and sort by time
df = df.append(empty, ignore_index=True)
df = df.sort_values('datetime').reset_index(drop=True)

# optionally, remove the time_diff column we created at the beginning
df.drop('time_diff', inplace=True, axis=1)

That will give you something like this: 这会给你这样的东西:

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM