简体   繁体   中英

Fastest way to iterate over Pandas DataFrame and insert a Row

I am building a tool to help automate reviewing of data from several laboratory setups on a weekly basis. A tab-delimited text file is generated each day. Each row represents data taken every 2 seconds so there are 43200 rows and many columns (each file is 75mb)

I am loading the seven text files using pandas.readcsv and only extracting the three columns I need into a pandas dataframe. This is slower than I'd like but acceptable. Then, I plot the data using Plotly offline to view the interactive plot. This is a scheduled task set up to run once a week.

The data is plotted vs date & time. Often times the test setups are offline temporarily and there will be gaps in the data. Unfortunately when this is plotted all the data is connected by lines, even if the test was offline for a period of hours or days.

The only way to prevent this is to insert a line with a date between the two dates with actual data and NaNs for all the missing data. I have implemented this for a missing data file easily enough however I want to generalize this for any gaps in data greater than a certain time period. I came up with a solution that seems to work but it is REALLY slow:

# alldata is a pandas dataframe with 302,000 rows and 4 columns
# one datetime column and three float32 columns

alldata_gaps  = pandas.DataFrame() #new dataframe with gaps in it

#iterate over all rows. If the datetime difference between 
#two consecutive rows is more than one minute, insert a gap row.

for i in range(0, len(alldata)):
    alldata_gaps = alldata_gaps.append(alldata.iloc[i])
    if alldata.iloc[i+1, 0]-alldata.iloc[i,0] > datetime.timedelta(minutes=1):
        Series = pandas.Series({'datetime' : alldata.iloc[i,0]
        +datetime.timedelta(seconds=3)})
        alldata_gaps = alldata_gaps.append(Series)
        print(Series)

Does anyone have a suggestion how I could speed this operation up so it doesn't take such an obnoxiously long time?

Here's a dropbox link to an example data file with only 100 lines

Here's a link to my current script without adding the gap rows

Almost certainly your bottleneck is from pd.DataFrame.append :

alldata_gaps = alldata_gaps.append(alldata.iloc[i])
alldata_gaps = alldata_gaps.append(Series)

As an aside, you've confusingly named a variable the same as a Pandas object pd.Series . It's good practice to avoid such ambiguity.

A much more efficient solution is to:

  1. Identify times after which gaps occur.
  2. Create a single dataframe with data for these times + 3 seconds.
  3. Append to your existing dataframe and sort by time.

So let's have a stab with a sample dataframe:

# example dataframe setup
df = pd.DataFrame({'Date': ['00:10:15', '00:15:20', '00:15:40', '00:16:50', '00:17:55',
                            '00:19:00', '00:19:10', '00:19:15', '00:19:55', '00:20:58'],
                   'Value': list(range(10))})

df['Date'] = pd.to_datetime('2018-11-06-' + df['Date'])

# find gaps greater than 1 minute
bools = (df['Date'].diff().dt.seconds > 60).shift(-1).fillna(False)
idx = bools[bools].index
# Int64Index([0, 2, 3, 4, 8], dtype='int64')

# construct dataframe to append
df_extra = df.loc[idx].copy().assign(Value=np.nan)

# add 3 seconds
df_extra['Date'] = df_extra['Date'] + pd.to_timedelta('3 seconds')

# append to original
res = df.append(df_extra).sort_values('Date')

Result:

print(res)

                 Date  Value
0 2018-11-06 00:10:15    0.0
0 2018-11-06 00:10:18    NaN
1 2018-11-06 00:15:20    1.0
2 2018-11-06 00:15:40    2.0
2 2018-11-06 00:15:43    NaN
3 2018-11-06 00:16:50    3.0
3 2018-11-06 00:16:53    NaN
4 2018-11-06 00:17:55    4.0
4 2018-11-06 00:17:58    NaN
5 2018-11-06 00:19:00    5.0
6 2018-11-06 00:19:10    6.0
7 2018-11-06 00:19:15    7.0
8 2018-11-06 00:19:55    8.0
8 2018-11-06 00:19:58    NaN
9 2018-11-06 00:20:58    9.0

My general idea is the same as jpp's answer: instead of iterating the dataframe (which is slow for the amount of data you have), you should just identify the rows of interest and work with those. Main differences are 1) turning multiple columns to NA and 2) adjusting the NA row time stamp to be half way between the surrounding times

I've added explanations throughout as comments...

# after you read in your data, make sure the time column is actually a datetime
df['datetime'] = pd.to_datetime(df['datetime'])

# calculate the (time) difference between a row and the previous row
df['time_diff'] = df['datetime'].diff()

# create a subset of your df where the time difference is greater than
# some threshold. This will be a dataframe of your empty/NA rows.
# I've set a 2 second threshold here because of the sample data you provided, 
# but could be any number of seconds
empty = df[df['time_diff'].dt.total_seconds() > 2].copy()

# calculate the correct timestamp for the NA rows (halfway and evenly spaced)
empty['datetime'] = empty['datetime'] - (empty['time_diff'].shift(-1) / 2)

# set all the columns to NA apart from the datetime column
empty.loc[:, ~empty.columns.isin(['datetime'])] = np.nan

# append this NA/empty dataframe to your original data, and sort by time
df = df.append(empty, ignore_index=True)
df = df.sort_values('datetime').reset_index(drop=True)

# optionally, remove the time_diff column we created at the beginning
df.drop('time_diff', inplace=True, axis=1)

That will give you something like this:

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM