简体   繁体   中英

Element wise average of arrays with different lengths from a dataframe

First I will explain what I wish to occur. I have a lot of arrays but say 3 as an example with different lengths. I want to get the average from comparing the arrays by each element.

A = [0,10,20]

B = [10,40,60,80]

C = [50,70]

Expected outcome = [20,40,40,80]

What I've tried is using zip_longest from itertools and using the mean function from statistics.

from itertools import zip_longest
from statistics import mean

outcome = [mean(n) for n in zip_longest(a, b, c, fillvalue=0)]

However as specified the fill value is 0 and so the outcome is not the one desired. Because of using the mean function I cannot set the fillvalue to None. Would I have to use a different function to calculate the mean? Or another method to get an element wise average of different lengthen arrays.

Edit: Apologies but forgot to talk about the origins of the arrays. So the arrays are from a pandas dataframe where in each row of a column the value is an array of x length.

Edit2: Adding more meaningfull data

Create dataframe using pandas of a csv file

Select portion of dataframe with 2 conditions

Try to get element wise average from 3rd column that satisfies the 2 conditions

df = pd.read_csv('data.csv')

sec1 = df[(df['Color'] == 'blue') & (df['Type'] == 21)

outcome = [np.nanmean(n) for n in zip_longest(sec1['time'], fillvalue=float("nan"))]

print(outcome)

Where sec1['time'] has the output where the arrays are different lengths

2168    [0, 10, 20, 29, 44, 47, 59, 71, 94, 198...
2169    [0, 0, 7, 12, 47, 84, 144, 163, 222...
...

One approach, is to use nan as fillvalue and filter out (using filterfalse ) the values when computing the mean, as below:

from itertools import zip_longest, filterfalse
from statistics import mean
from math import isnan

a = [0, 10, 20]
b = [10, 40, 60, 80]
c = [50, 70]

outcome = [mean(filterfalse(isnan, n)) for n in zip_longest(a, b, c, fillvalue=float("nan"))]
print(outcome)

Output

[20, 40, 40, 80]

I suggest you use fmean :

outcome = [fmean(filterfalse(isnan, n)) for n in zip_longest(a, b, c, fillvalue=float("nan"))]
print(outcome)

is faster than mean , from the documentation:

This runs faster than the mean() function and it always returns a float. The data may be a sequence or iterable. If the input dataset is empty, raises a StatisticsError.

Another alternative is to use numpy nanmean :

from itertools import zip_longest
import numpy as np

a = [0, 10, 20]
b = [10, 40, 60, 80]
c = [50, 70]

outcome = [np.nanmean(n) for n in zip_longest(a, b, c, fillvalue=float("nan"))]
print(outcome)

Output

[20.0, 40.0, 40.0, 80.0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM