[英]pandas dataframe groupby columns and aggregate on custom function
[英]Pandas groupby object.aggregate with a custom list manipulation function
我有一個csv文件,如下所示
Hour,L,Dr,Tag,Code,Vge
0,L5,XI,PS,4R,15
0,L3,St,sst,4R,17
5,L5,XI,PS,4R,12
2,L0,St,v2T,4R,11
8,L2,TI,sst,4R,8
12,L5,XI,PS,4R,18
2,L2,St,PS,4R,9
12,L3,XI,sst,4R,16
我在ipython
筆記本中執行以下腳本。
In[1]
import pandas as pd
In[2]
df = pd.read_csv('/python/concepts/pandas/in.csv')
In[3]
df.head(n=9)
Out[1]:
Hour L Dr Tag Code Vge
0 0 L5 XI PS 4R 15
1 0 L3 St sst 4R 17
2 5 L5 XI PS 4R 12
3 2 L0 St v2T 4R 11
4 8 L2 TI sst 4R 8
5 12 L5 XI PS 4R 18
6 2 L2 St PS 4R 9
7 12 L3 XI sst 4R 16
In[4]
df.groupby(('Hour'))['Vge'].aggregate(np.sum)
Out[2]:
Hour
0 32
2 20
5 12
8 8
12 34
Name: Vge, dtype: int64
現在我寫一個列表操作square_list
。
In[4]
newlist = []
In[5]
def square_list(x):
for item in x:
newlist.append(item**item)
return newlist
In [44]: df.groupby(('Hour'))['Vge'].aggregate(square_list)
Out[44]:
Hour
0 [437893890380859375, -2863221430593058543, 437...
2 [437893890380859375, -2863221430593058543, 437...
5 [437893890380859375, -2863221430593058543, 437...
8 [437893890380859375, -2863221430593058543, 437...
12 [437893890380859375, -2863221430593058543, 437...
Name: Vge, dtype: object
輸出看起來很奇怪。我所期待的是第一個輸出中項目的squares
。
如果我使用
df.groupby(('Hour'))['Vge'].aggregate(lambda x: x ** x)
我收到以下錯誤。
ValueError Traceback (most recent call last)
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in agg_series(self, obj, func)
1632 try:
-> 1633 return self._aggregate_series_fast(obj, func)
1634 except Exception:
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in _aggregate_series_fast(self, obj, func)
1651 dummy)
-> 1652 result, counts = grouper.get_result()
1653 return result, counts
pandas/src/reduce.pyx in pandas.lib.SeriesGrouper.get_result (pandas/lib.c:38634)()
pandas/src/reduce.pyx in pandas.lib.SeriesGrouper.get_result (pandas/lib.c:38503)()
pandas/src/reduce.pyx in pandas.lib._get_result_array (pandas/lib.c:32023)()
ValueError: function does not reduce
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
2339 try:
-> 2340 return self._python_agg_general(func_or_funcs, *args, **kwargs)
2341 except Exception:
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs)
1167 try:
-> 1168 result, counts = self.grouper.agg_series(obj, f)
1169 output[name] = self._try_cast(result, obj)
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in agg_series(self, obj, func)
1634 except Exception:
-> 1635 return self._aggregate_series_pure_python(obj, func)
1636
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in _aggregate_series_pure_python(self, obj, func)
1668 isinstance(res, list)):
-> 1669 raise ValueError('Function does not reduce')
1670 result = np.empty(ngroups, dtype='O')
ValueError: Function does not reduce
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
<ipython-input-47-874cf4c23d53> in <module>()
----> 1 df.groupby(('Hour'))['Vge'].aggregate(lambda x : x**x)
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
2340 return self._python_agg_general(func_or_funcs, *args, **kwargs)
2341 except Exception:
-> 2342 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
2343
2344 index = Index(sorted(result), name=self.grouper.names[0])
/Applications/anaconda/lib/python3.5/site-packages/pandas/core/groupby.py in _aggregate_named(self, func, *args, **kwargs)
2429 output = func(group, *args, **kwargs)
2430 if isinstance(output, (Series, Index, np.ndarray)):
-> 2431 raise Exception('Must produce aggregated value')
2432 result[name] = self._try_cast(output, group)
2433
Exception: Must produce aggregated value
你在仔細閱讀這個錯誤嗎? 它說功能不會降低。 請花幾分鍾時間來正確定義您想要的內容。 這也是你的square_list()
函數的確切問題,它返回一個列表,而不是列表元素的總和。 它沒有減少。
如果你想要簡單的總和:
df.groupby('Hour')['Vge'].sum()
如果要平方列中的所有元素:
df['Vge_squared'] = df['Vge']**2
如果你想要組的平方和:
df.groupby('Hour')['Vge_squared'].sum()
要么,
def square_list(x):
x = numpy.array(x)
return numpy.sum(numpy.multiply(x,x))
df.groupby('Hour')['Vge'].aggregate(square_list)
要么,
def square_list(x):
for item in x:
newlist.append(item**item)
return newlist
df.groupby('Hour')['Vge'].aggregate(square_list).apply(sum)
希望這可以幫助。
首先,第一輸出是“預期”,因為每次調用square_list
被追加到全球 newlist
。
您可以在每次調用時創建列表:
def square_list(x):
newlist = []
for item in x:
newlist.append(item**item)
return newlist
In [11]: df.groupby(('Hour'))['Vge'].aggregate(square_list)
Out[11]:
Hour
0 [437893890380859375, -2863221430593058543]
2 [285311670611, 387420489]
5 [8916100448256]
8 [16777216]
12 [-497033925936021504, 0]
dtype: object
但我懷疑這不是你想要的。
錯誤消息非常准確:“必須產生聚合值”。 目前你的lambda沒有返回單個值。
也許你想要總和:
In [21]: df.groupby(('Hour'))['Vge'].aggregate(lambda x: (x ** x).sum())
Out[21]:
Hour
0 -8785478146473916416
2 285699091100
5 8916100448256
8 16777216
12 0
Name: Vge, dtype: int64
注意:為正方形創建一個虛擬列可能會更快,然后是一個“干凈”的總和。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.