简体   繁体   中英

Pandas dataframe average truly unique values

I'm working with a number of measurementsets, each measurementset contains two values: the datetime and the temperature. Example:

# measurement 1:
    time | value
00:00:00 | 10.1
00:00:10 | 10.12
00:00:20 | 10.14
00:00:30 | 10.12
00:00:40 | 10.11
00:00:50 | 10.13

# measurement 2:
    time | value
00:00:01 | 10.11
00:00:11 | 10.13
00:00:21 | 10.14
00:00:31 | 10.12
00:00:41 | 10.12
00:00:51 | 10.11

# measurement 3:
    time | value
00:00:00 | 10.2
00:00:10 | 10.22
00:00:20 | 10.24
00:00:30 | 10.22
00:00:40 | 10.21
00:00:50 | 10.23

I load these sets in pandas dataframes and merge them into a single dataframe using an outer join:

df = pd.merge(left=df1, right=df2, how='outer', left_on='time', right_on='time', suffixes=("1", "2"))

I want to average the values of the three dataframes, however: Sometimes the time is not exactly the same, resulting in values on different rows such that taking the average is difficult. Take for example the join of measurement 2 and measurement 3:

# measurement 2 & 3 merged:
    time | value2 | value3
00:00:01 | 10.11  | -
00:00:11 | 10.13  | -
00:00:21 | 10.14  | -
00:00:31 | 10.12  | -
00:00:41 | 10.12  | -
00:00:51 | 10.11  | -
00:00:00 | -      | 10.2
00:00:10 | -      | 10.22
00:00:20 | -      | 10.24
00:00:30 | -      | 10.22
00:00:40 | -      | 10.21
00:00:50 | -      | 10.23

In this case the times are not exactly the same, is there a way to get these on the same row such that they can be averaged?

Sometimes a device has exported the data multiple times (at different times). This means that certain measurements are not unique (exactly the same time and exactly the same value). How would I make sure that I do not take these (double) measurements into account when averaging?

Hope someone can help.

EDIT: added an image and some clarification I have plot the six datasets. To be able to explain better I've added 0, 10, 20, 30, 40 and 50 to the different graphs because else some would be on top of eachother. The yellow, magenta and cyan measurements are exactly on top of eachother, in value and in datetime because they're from the same source (except the data is exported multiple times).

The green and red dataset are different in value (approximately 40) and haven't measured at exactly the same time (can be off by a few minutes for example).

From all these measurements I want to create the average line. Since Magenta, cyan and yellow are the same the average should be one of their value. But from a certain point there's blue and green and red. In that case I'm looking for a calculated average, but the datetime is not exactly the same.

测量图

To get the value1, value2 and value3 on the same col, I used:

df = pd.concat([df1, df2, df3])

The example below looks like yours:

import pandas as pd

df1 = pd.DataFrame({'Time': ['00:00:00', '00:00:10', '00:00:20', '00:00:30', '00:00:40', '00:00:50'],
                    'Value': ['10', '1', '2', '3', '4', '8']})


df2 = pd.DataFrame({'Time': ['00:00:01', '00:00:11', '00:00:21', '00:00:31', '00:00:41', '00:00:51'],
                    'Value': ['10', '1', '2', '3', '4', '8']})


df3 = pd.DataFrame({'Time': ['00:00:00', '00:00:10', '00:00:20', '00:00:30', '00:00:40', '00:00:50'],
                    'Value': ['10', '1', '2', '3', '4', '8']})

df = pd.concat([df1, df2, df3])

print(df):
       Time Value
0  00:00:00    10
1  00:00:10     1
2  00:00:20     2
3  00:00:30     3
4  00:00:40     4
5  00:00:50     8
0  00:00:01    10
1  00:00:11     1
2  00:00:21     2
3  00:00:31     3
4  00:00:41     4
5  00:00:51     8
0  00:00:00    10
1  00:00:10     1
2  00:00:20     2
3  00:00:30     3
4  00:00:40     4
5  00:00:50     8

Solved it:

I first concatenated all the none duplicate entries:

for idf, df in enumerate(data[:-1]):
if idf == 0:
    df_new = data[idf]
df_new = pd.concat([df_new, data[idf+1][(~(data[idf+1].datetime.isin(df_new.datetime)) | ~(data[idf+1].value.isin(df_new.value)))]])

Then I set the index:

df_new = df_new.set_index('datetime')

And finally I resample and take the mean:

avg = df_new.resample('1800s').mean().dropna()

This results in the correct average.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM