[英]More bizarre results using: groupby and nlargest() in pandas
此問題是以下帖子的擴展: 使用pandas選擇每個groupby組的列的最大N.
讓我們使用相同的df和所選答案中提出的解決方法。 基本上,我正在嘗試進行2次groupby操作並選擇每組的nlargest N. 但是,正如您在下面看到的,我得到其中一個操作的錯誤。
鑒於原始帖子在代碼中發現了一個錯誤( 請參見此處 ),我想知道是否還有其他錯誤或同一錯誤的其他表現形式?
不幸的是,在這些問題得到修復和解決之前,我仍處於工作中。 我們能不能在這件事上得到一些關注? 直到明天我才能提供賞金。
DF:
{'city1': {0: 'Chicago',
1: 'Chicago',
2: 'Chicago',
3: 'Chicago',
4: 'Miami',
5: 'Houston',
6: 'Austin'},
'city2': {0: 'Toronto',
1: 'Detroit',
2: 'St.Louis',
3: 'Miami',
4: 'Dallas',
5: 'Dallas',
6: 'Dallas'},
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
'plant1_type': {0: 'COMBCYCL',
1: 'COMBCYCL',
2: 'NUKE',
3: 'COAL',
4: 'NUKE',
5: 'COMBCYCL',
6: 'COAL'},
'plant2_type': {0: 'COAL',
1: 'COAL',
2: 'COMBCYCL',
3: 'COMBCYCL',
4: 'COAL',
5: 'NUKE',
6: 'NUKE'}}
您可以使用上面的dict生成df: pd.DataFrame(dct)
First groupby:似乎生成有意義的結果
cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()
city2 plant1_type plant2_type p234_r_c
0 Toronto COMBCYCL COAL 5.0
1 Detroit COMBCYCL COAL 4.0
2 St.Louis NUKE COMBCYCL 2.0
3 Miami COAL COMBCYCL 0.5
4 Dallas NUKE COAL 1.0
5 Dallas COMBCYCL NUKE 4.0
6 Dallas COAL NUKE 3.0
第二組:產生錯誤。 唯一的區別是使用city1
而不是city2
。
cols = ['city1','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()
錯誤結果:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-443-6426182b55e1> in <module>()
----> 1 test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\series.py in reset_index(self, level, drop, name, inplace)
967 else:
968 df = self.to_frame(name)
--> 969 return df.reset_index(level=level, drop=drop)
970
971 def __unicode__(self):
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
2944 level_values = _maybe_casted_values(lev, lab)
2945 if level is None or i in level:
-> 2946 new_obj.insert(0, col_name, level_values)
2947
2948 elif not drop:
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
2447 value = self._sanitize_column(column, value)
2448 self._data.insert(loc, column, value,
-> 2449 allow_duplicates=allow_duplicates)
2450
2451 def assign(self, **kwargs):
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\internals.py in insert(self, loc, item, value, allow_duplicates)
3508 if not allow_duplicates and item in self.items:
3509 # Should this be a different kind of error??
-> 3510 raise ValueError('cannot insert %s, already exists' % item)
3511
3512 if not isinstance(loc, int):
ValueError: cannot insert plant2_type, already exists
最后:
我怎樣才能獲得city1
使用GROUPBY的結果列['city2','plant1_type','plant2_type']
和city2
列GROUPBY的結果,使用['city1','plant1_type','plant2_type']
?
我想知道相應的city1
使用GROUPBY值['city2','plant1_type','plant2_type']
和相應的city2
使用GROUPBY值['city1','plant1_type','plant2_type']
。
更新:
為什么以下結果具有完全不同的結構? 唯一的區別是city2
用於#A,而city1
用於#B。
一種)
cols = ['city2','plant1_type','plant2_type']
test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)
city2 plant1_type plant2_type
Toronto COMBCYCL COAL 5.0
Detroit COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
Miami COAL COMBCYCL 0.5
Dallas NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0
Name: p234_r_c, dtype: float64
B)
cols2 = ['city1','plant1_type','plant2_type']
test1.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin COAL NUKE Austin COAL NUKE 3.0
Chicago COAL COMBCYCL Chicago COAL COMBCYCL 0.5
COMBCYCL COAL Chicago COMBCYCL COAL 5.0
NUKE COMBCYCL Chicago NUKE COMBCYCL 2.0
Houston COMBCYCL NUKE Houston COMBCYCL NUKE 4.0
Miami NUKE COAL Miami NUKE COAL 1.0
Name: p234_r_c, dtype: float64
嘗試這個:
In [76]: df.groupby(cols2)['p234_r_c'].nlargest(1).reset_index(level=3, drop=True).reset_index()
Out[76]:
city1 plant1_type plant2_type p234_r_c
0 Austin COAL NUKE 3.0
1 Chicago COAL COMBCYCL 0.5
2 Chicago COMBCYCL COAL 5.0
3 Chicago NUKE COMBCYCL 2.0
4 Houston COMBCYCL NUKE 4.0
5 Miami NUKE COAL 1.0
坦率地說,我不明白以下行為:
In [77]: df.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)
Out[77]:
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin COAL NUKE Austin COAL NUKE 3.0
Chicago COAL COMBCYCL Chicago COAL COMBCYCL 0.5
COMBCYCL COAL Chicago COMBCYCL COAL 5.0
NUKE COMBCYCL Chicago NUKE COMBCYCL 2.0
Houston COMBCYCL NUKE Houston COMBCYCL NUKE 4.0
Miami NUKE COAL Miami NUKE COAL 1.0
Name: p234_r_c, dtype: float64
哪里:
In [78]: cols2
Out[78]: ['city1', 'plant1_type', 'plant2_type']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.