简体   繁体   中英

Python: Faster way to insert rows into a DataFrame at specific locations?

I've got a DataFrame of ~40,000 rows. The DataFrame looks roughly like:

             Unix Time                           UTC  Val. 1  Val. 2  Val. 3
1    1518544176.927486    2018-02-13 17:49:36.927486    5.00    0.25    2.00
2    1518544176.929897    2018-02-13 17:49:36.929897    4.50    1.00    3.00
3    1518544176.932310    2018-02-13 17:49:36.932310    4.00    0.75    1.75
...

Columns 0, 2-4 are type <class 'numpy.float64'> . Column 1 is type <class 'pandas._libs.tslib.Timestamp'> . When one plots any of the data columns against time, we see a waveform. However, there are occasional breaks in the acquisition. For instance, we might have:

               Unix Time                           UTC  Val. 1  Val. 2  Val. 3
576    1518544181.755085    2018-02-13 17:49:41.755085    0.10    0.01    0.93
577    1518544182.041129    2018-02-13 17:49:42.041129    0.11    0.02    0.95
...

As one can see, there's a ~0.3 s gap between readings 576 and 577. The problem is that when plotting the data, matplotlib connect the dots, even when there's no data. Solutions to this "problem" have been addressed in other questions on Stack Overflow and online at-large, and though I'm not in love with... well, any of them, the best option seems to be to insert NaNs into the data gaps. Since matplotlib doesn't plot NaNs, it's a sneaky way to trick it into making your plot more realistic.

To do this, I start by finding the time delta between the first two readings (this is safe), and using twice that value as my metric for "is there a gap?" I then iterate through the DataFrame, checking the gaps. Upon finding one I create a temporary row of NaNs in the data columns, and time values right in the middle of the acquisition gap in the time columns. I then modify a new DataFrame made of the old one, plus this row. This can be seem here:

df2 = df.copy()
for i, row in df.iterrows():
    # The following code checks the delta-t of all timestamp pairs.
    # We have i > 0 because it can't activate on the first entry.
    if i > 0:
        delta_t_unix = row['Unix Time'] - prev_timestamp_unix
        delta_t_utc = row['UTC'] - prev_timestamp_utc
        # If delta_t_unix > the allowed data gap, add new timestamps and NaNs.
        if delta_t_unix > allowed_gap:
            time_unix = row['Unix Time'] - (delta_t_unix / 2.0)
            time_utc = row['UTC'] - (delta_t_utc / 2.0)
            val1 = np.nan
            val2 = np.nan
            val3 = np.nan
            new_row = pd.DataFrame({'Unix Time': time_unix, 'UTC': time_utc,
                                    'Val. 1': val1, 'Val. 2': val2,
                                    'Val. 3': val3}, index = [i])
            df2 = pd.concat([df2.ix[:i-1], new_row,
                            df2.ix[i:]]).reset_index(drop = True)
    # Set the previous timestamp for use in the beginning of the loop.
    prev_timestamp_unix = row[timestamp_unix]
    prev_timestamp_utc = row[timestamp_utc]
# Make the final DataFrame with the completed lists.
df2 = df2[['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3']]

This currently takes ~4.5 seconds, thanks to this question (it used to take ~6.5, as I was foolishly iterating through and creating new lists of each column, and then creating a new DataFrame out of those). However, that's still far slower than I'd expect or prefer. Does anyone have any ideas on how to speed this up? I'm still quite new to Pandas and DataFrames, so I'm sure this could be better. Thanks!

EDIT: Worth mentioning that if I remove the datetime column, it splits the time in half (though unfortunately I can't remove that in practice).

You can resample to 2.4ms using something like:

df['utc_time'] = pd.to_datetime(df['utc_time'])
df.set_index(df['utc_time'])[['val1','val2','val3']].resample('2.4ms', loffset='1.2ms').mean().reset_index()

Somethings that can speed that up:

  1. Use df.itertuples() instead of df.iterrows() and it will probably improve somewhat the execution time.

If you can post the improvements I would appreciate, but in my experience it was a huge difference in the cases I tested (10x improvement for much simpler instructions inside the loop).

  1. Instead of using pd.concat to put rows together, create a list of tuples and then generate a DataFrame only after the loop from this list.

     for i, unix_time, utc_time, val1, val2, val3 in df.itertuples(): list_of_values = [] if i > 0: delta_t_unix = unix_time - prev_timestamp_unix delta_t_utc = utc_time - prev_timestamp_utc if delta_t_unix > allowed_gap: new_time_unix = unix_time - (delta_t_unix / 2.0) new_time_utc = utc_time - (delta_t_utc / 2.0) list_of_values.append((new_time_unix, new_time_utc, np.nan, np.nan, np.nan)) # Set the previous timestamp for use in the beginning of the loop. prev_timestamp_unix = unix_time prev_timestamp_utc = utc_time list_of_values.append(((unix_time , utc_time, val1, val2, val3)) df2 = pd.DataFrame(list_of_values, columns=['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3']) 

That will probably speed up things quite a lot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM