I have a Pandas dataframe for which the first 6 lines look like below:
Timestamp u1 u2 u3
0 0 0.00000 23.02712 30.46594
1 2 0.00000 22.31358 30.10915
2 4 0.00000 19.10267 25.47093
3 6 0.00000 18.38913 23.68700
4 8 0.00000 19.81620 23.68700
5 10 0.00000 18.03236 21.18952
This data was captured by a datalogger, and the datalogger gets triggered under certain circumstances. Meaning, the Timestamp values (given in 100s of seconds) are not always following a strict sequence, and there may be gaps in the data time-wise, when the datalogger is inactive.
I am trying to capture the maximumum u3 values and the corresponding values in other columns (meaning from the same line where maximum u3 occurs) captured at every 15 minutes. When converted to my Timestamp values, this is every 15 x 60 x 100 = 90000
1/100 seconds.
I managed to get the locations of maximum u3 values using the script below (it only prints the index numbers for now):
counter = df.Timestamp.max()/90000
for i in range(counter):
df_temp = df[(df.Timestamp >= i*90000) & (df.Timestamp < (i+1)*90000)]
try:
print df_temp["u3"].argmax()
except ValueError:
print "NaN"
What I am trying to do is collecting whole rows from these locations and append them to a new dataframe, the index value being i
in the script given above. How can I get the entire row (since I know the index through argmax()
) and append it to a new dataframe? There is also the NaN issue, meaning if there is no data in the said interval, then the script should add NaNs for all columns in that row. What would be an easy way to do this?
Thanks!
You could collect the data frames that have the max u3 values, and use pd.concat
to put them back together -
counter = df.Timestamp.max()/90000
collected_dfs = []
for i in range(counter):
df_temp = df[(df.Timestamp >= i*90000) & (df.Timestamp < (i+1)*90000)]
try:
if len(df_temp):
collected_dfs.append(df_temp[df_temp['u3'] == df_temp['u3'].max()])
else:
df_nan = pd.DataFrame({'Timestamp': [i*90000], 'u1': [np.nan], 'u2': [np.nan], 'u3': [np.nan]})
collected_dfs.append(df_nan)
except ValueError:
print "NaN"
pd.concat(collected_dfs, ignore_index=True)
If the data looks like this:
Timestamp u1 u2 u3
0 0.00000 23.02712 30.46594
2 0.00000 22.31358 30.10915
4 0.00000 19.10267 25.47093
6 0.00000 18.38913 23.68700
8 0.00000 19.81620 23.68700
10 0.00000 18.03236
16 1 2 3
then
import numpy as np
import pandas as pd
chunksize = 4 # change this to 90000
df = pd.read_table('data', sep='\s+')
df['index'] = df['Timestamp']//chunksize
result = df.loc[df.groupby('index')['u3'].idxmax()]
N = result['index'].max()
result.set_index('index', inplace=True)
result = result.reindex(index=np.arange(N+1))
print(result)
yields
Timestamp u1 u2 u3
0 0 0 23.02712 30.46594
1 4 0 19.10267 25.47093
2 8 0 19.81620 23.68700
3 NaN NaN NaN NaN
4 16 1 2.00000 3.00000
I used a chunksize of 4 to make the grouping noticeable on the small dataset; you'll want to change it to 90000 for your real dataset.
The main idea is to compute df['Timestamp']//chunksize
and to use these values in the call to df.groupby
, to group together the desired rows.
df.groupby('index')['u3'].idxmax()
finds the index labels of the rows with maximum u3
value for each group.
Inserting NaNs when there is no data is accomplished by making the index
column the index and then calling reindex
.
result = result.reindex(index=np.arange(N+1))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.