简体   繁体   中英

Append rows from a Pandas DataFrame to a new DataFrame

I have a Pandas dataframe for which the first 6 lines look like below:

               Timestamp     u1                 u2                  u3  
0              0             0.00000            23.02712            30.46594   
1              2             0.00000            22.31358            30.10915   
2              4             0.00000            19.10267            25.47093   
3              6             0.00000            18.38913            23.68700   
4              8             0.00000            19.81620            23.68700   
5             10             0.00000            18.03236            21.18952  

This data was captured by a datalogger, and the datalogger gets triggered under certain circumstances. Meaning, the Timestamp values (given in 100s of seconds) are not always following a strict sequence, and there may be gaps in the data time-wise, when the datalogger is inactive.

I am trying to capture the maximumum u3 values and the corresponding values in other columns (meaning from the same line where maximum u3 occurs) captured at every 15 minutes. When converted to my Timestamp values, this is every 15 x 60 x 100 = 90000 1/100 seconds.

I managed to get the locations of maximum u3 values using the script below (it only prints the index numbers for now):

counter = df.Timestamp.max()/90000
for i in range(counter):
    df_temp = df[(df.Timestamp >= i*90000) & (df.Timestamp < (i+1)*90000)]
    try:        
        print df_temp["u3"].argmax()
    except ValueError:
        print "NaN"

What I am trying to do is collecting whole rows from these locations and append them to a new dataframe, the index value being i in the script given above. How can I get the entire row (since I know the index through argmax() ) and append it to a new dataframe? There is also the NaN issue, meaning if there is no data in the said interval, then the script should add NaNs for all columns in that row. What would be an easy way to do this?

Thanks!

You could collect the data frames that have the max u3 values, and use pd.concat to put them back together -

counter = df.Timestamp.max()/90000
collected_dfs = []
for i in range(counter):
    df_temp = df[(df.Timestamp >= i*90000) & (df.Timestamp < (i+1)*90000)]
    try:
        if len(df_temp):
            collected_dfs.append(df_temp[df_temp['u3'] == df_temp['u3'].max()])
        else:
            df_nan = pd.DataFrame({'Timestamp': [i*90000], 'u1': [np.nan], 'u2': [np.nan], 'u3': [np.nan]})
            collected_dfs.append(df_nan)
    except ValueError:
        print "NaN"
pd.concat(collected_dfs, ignore_index=True)

If the data looks like this:

 Timestamp     u1                 u2                  u3  
 0             0.00000            23.02712            30.46594   
 2             0.00000            22.31358            30.10915   
 4             0.00000            19.10267            25.47093   
 6             0.00000            18.38913            23.68700   
 8             0.00000            19.81620            23.68700   
10             0.00000            18.03236    
16             1                  2                   3

then

import numpy as np
import pandas as pd

chunksize = 4  # change this to 90000
df = pd.read_table('data', sep='\s+')
df['index'] = df['Timestamp']//chunksize
result = df.loc[df.groupby('index')['u3'].idxmax()]
N = result['index'].max()
result.set_index('index', inplace=True)
result = result.reindex(index=np.arange(N+1))
print(result)

yields

   Timestamp  u1        u2        u3
0          0   0  23.02712  30.46594
1          4   0  19.10267  25.47093
2          8   0  19.81620  23.68700
3        NaN NaN       NaN       NaN
4         16   1   2.00000   3.00000

I used a chunksize of 4 to make the grouping noticeable on the small dataset; you'll want to change it to 90000 for your real dataset.


The main idea is to compute df['Timestamp']//chunksize and to use these values in the call to df.groupby , to group together the desired rows.

df.groupby('index')['u3'].idxmax()

finds the index labels of the rows with maximum u3 value for each group.

Inserting NaNs when there is no data is accomplished by making the index column the index and then calling reindex .

result = result.reindex(index=np.arange(N+1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM