[英]Select multiple groups from pandas groupby object
I am experimenting with the groupby features of pandas, in particular我正在试验熊猫的 groupby 功能,特别是
gb = df.groupby('model')
gb.hist()
Since gb has 50 groups the result is quite cluttered, I would like to explore the result only for the first 5 groups.由于 gb 有 50 个组,结果非常混乱,我只想探索前 5 个组的结果。
I found how to select a single group with groups
or get_group
( How to access pandas groupby dataframe by key ), but not how to select multiple groups directly.我找到了如何使用
groups
或get_group
选择单个组(如何通过键访问熊猫 groupby 数据get_group
),但没有找到如何直接选择多个组。 The best I could do is :我能做的最好的是:
groups = dict(list(gb))
subgroup = pd.concat(groups.values()[:4])
subgroup.groupby('model').hist()
Is there a more direct way?有没有更直接的方法?
It'd be easier to just filter your df first and then perform the groupby
:首先过滤 df 然后执行
groupby
会更容易:
In [155]:
df = pd.DataFrame({'model':np.random.randint(1,10,100), 'value':np.random.randn(100)})
first_five = df['model'].sort(inplace=False).unique()[:5]
gp = df[df['model'].isin(first_five)].groupby('model')
gp.first()
Out[155]:
value
model
1 -0.505677
2 1.217027
3 -0.641583
4 0.778104
5 -1.037858
You can do something like你可以做类似的事情
new_gb = pandas.concat( [ gb.get_group(group) for i,group in enumerate( gb.groups) if i < 5 ] ).groupby('model')
new_gb.hist()
Although, I would approach it differently.虽然,我会以不同的方式处理它。 You can use the
collections.Counter
object to get groups fast:您可以使用
collections.Counter
对象快速获取组:
import collections
df = pandas.DataFrame.from_dict({'model': pandas.np.random.randint(0, 3, 10), 'param1': pandas.np.random.random(10), 'param2':pandas.np.random.random(10)})
# model param1 param2
#0 2 0.252379 0.985290
#1 1 0.059338 0.225166
#2 0 0.187259 0.808899
#3 2 0.773946 0.696001
#4 1 0.680231 0.271874
#5 2 0.054969 0.328743
#6 0 0.734828 0.273234
#7 0 0.776684 0.661741
#8 2 0.098836 0.013047
#9 1 0.228801 0.827378
model_groups = collections.Counter(df.model)
print(model_groups) #Counter({2: 4, 0: 3, 1: 3})
Now you can iterate over the Counter
object like a dictionary, and query the groups you want:现在您可以像字典一样遍历
Counter
对象,并查询您想要的组:
new_df = pandas.concat( [df.query('model==%d'%key) for key,val in model_groups.items() if val < 4 ] ) # for example, but you can select the models however you like
# model param1 param2
#2 0 0.187259 0.808899
#6 0 0.734828 0.273234
#7 0 0.776684 0.661741
#1 1 0.059338 0.225166
#4 1 0.680231 0.271874
#9 1 0.228801 0.827378
Now you can use the built-in pandas.DataFrame.groupby
function现在您可以使用内置的
pandas.DataFrame.groupby
函数
gb = new_df.groupby('model')
gb.hist()
Since model_groups
contains all of the groups, you can just pick from it as you wish.由于
model_groups
包含所有组,您可以根据需要从中选择。
If your model
column contains string values (names or something) instead of integers, it will all work the same - just change the query argument from 'model==%d'%key
to 'model=="%s"'%key
.如果您的
model
列包含字符串值(名称或其他内容)而不是整数,则它的工作方式都相同 - 只需将查询参数从'model==%d'%key
更改为'model=="%s"'%key
.
I don't know of a way to use the .get_group()
method with more than one group.我不知道有什么方法可以将
.get_group()
方法用于多个组。
You can however iterate through groups但是,您可以遍历组
It is still a bit ugly to do this, but here is one solution with iteration:这样做仍然有点难看,但这是一个迭代的解决方案:
limit = 5
i = 0
for key, group in gd:
print key, group
i += 1
if i >= limit:
break
You could also do a loop with .get_group()
, which imho.你也可以用
.get_group()
做一个循环,恕我直言。 is a little prettier, but still quite ugly.有点漂亮,但仍然很丑。
for key in gd.groups.keys()[:2]:
print gd.get_group(key)
gbidx=list(gb.indices.keys())[:4]
dfidx=np.sort(np.concatenate([gb.indices[x] for x in gbidx]))
df.loc[dfidx].groupby('model').hist()
gb.indices is faster than gb.groups or list(gb) gb.indices 比 gb.groups 或 list(gb) 快
and I believe concat Index is faster than concat DataFrames我相信 concat Index 比 concat DataFrames 快
I've tried on my big csv file of ~416M rows 13 cols (incl. str) and 720MB in size, and groupby by more than one col我已经尝试过我的 ~416M 行 13 列(包括 str)和 720MB 大小的大 csv 文件,并且通过多个列进行分组
then changed col names into those in the Question然后将 col 名称更改为问题中的名称
def get_groups(group_object):
for i in group_object.groups.keys():
print(f"____{i}____")
display(group_object.get_group(i))
#get all groups by calling this method
get_groups( any_group_which_you_made )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.