简体   繁体   中英

Why does a pandas Series of DataFrame mean() fail, but sum() does not, and how to make it work?

There may be a smarter way to do this in Python Pandas, but the following example should, but doesn't work:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0]], columns=['a', 'b'])
df2 = df1.copy()
df3 = df1.copy()

idx = pd.date_range("2010-01-01", freq='H', periods=3)
s = pd.Series([df1, df2, df3], index=idx)
# This causes an error
s.mean()

I won't post the whole traceback, but the main error message is interesting:

TypeError: Could not convert    melt  T_s
0     6   12
1     0    6
2     6   10 to numeric

It looks like the dataframe was successfully sum'med, but not divided by the length of the series.

However, we can take the sum of the dataframes in the series:

s.sum()

... returns:

      a     b
0     6   12
1     0    6
2     6   10

Why wouldn't mean() work when sum() does? Is this a bug or a missing feature? This does work:

(df1 + df2 + df3)/3.0

... and so does this:

s.sum()/3.0
      a  b
0     2  4.000000
1     0  2.000000
2     2  3.333333

But this of course is not ideal.

You could (as suggested by @unutbu) use a hierarchical index but when you have a three dimensional array you should consider using a " pandas Panel ". Especially when one of the dimensions represents time as in this case.

The Panel is oft overlooked but it is after all where the name pandas comes from. (Panel Data System or something like that).

Data slightly different from your original so there are not two dimensions with the same length:

df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0], [2, 3]], columns=['a', 'b'])
df2 = df1 + 1
df3 = df1 + 10

Panels can be created a couple of different ways but one is from a dict. You can create the dict from your index and the dataframes with:

s = pd.Panel(dict(zip(idx,[df1,df2,df3])))

The mean you are looking for is simply a matter of operating on the correct axis (axis=0 in this case):

s.mean(axis=0)

Out[80]:
          a         b
0  4.666667  3.666667
1  4.666667  5.666667
2  5.666667  3.666667
3  5.666667  6.666667

With your data, sum(axis=0) returns the expected result.

EDIT: OK too late for panels as the hierarchical index approach is already "accepted". I will say that that approach is preferable if the data is know to be " ragged " with an unknown but different number in each grouping. For " square " data, the panel is absolutly the way to go and will be significantly faster with more built-in operations. Pandas 0.15 has many improvements for multi-level indexing but still has limitations and dark edge cases in real world apps.

When you define s with

s = pd.Series([df1, df2, df3], index=idx)

you get a Series with DataFrames as items:

In [77]: s
Out[77]: 
2010-01-01 00:00:00       a  b
0  1  0
1  1  2
2  2  0
2010-01-01 01:00:00       a  b
0  1  0
1  1  2
2  2  0
2010-01-01 02:00:00       a  b
0  1  0
1  1  2
2  2  0
Freq: H, dtype: object

The sum of the items is a DataFrame:

In [78]: s.sum()
Out[78]: 
   a  b
0  3  0
1  3  6
2  6  0

but when you take the mean, nanops.nanmean is called :

def nanmean(values, axis=None, skipna=True):
    values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max))
    ...

Notice that _ensure_numeric ( source code ) is called on the resultant sum. An error is raised because a DataFrame is not numeric.

Here is a workaround. Instead of making a Series with DataFrames as items, you can concatenate the DataFrames into a new DataFrame with a hierarchical index :

In [79]: s = pd.concat([df1, df2, df3], keys=idx)

In [80]: s
Out[80]: 
                       a  b
2010-01-01 00:00:00 0  1  0
                    1  1  2
                    2  2  0
2010-01-01 01:00:00 0  1  0
                    1  1  2
                    2  2  0
2010-01-01 02:00:00 0  1  0
                    1  1  2
                    2  2  0

Now you can take the sum and the mean :

In [82]: s.sum(level=1)
Out[82]: 
   a  b
0  3  0
1  3  6
2  6  0

In [84]: s.mean(level=1)
Out[84]: 
   a  b
0  1  0
1  1  2
2  2  0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM