将groupby与pandas一起使用后查找唯一列元素的数量

Question

我有一个设置如下的数据集：

rows = [
    ('us', 0, 'ca', None, 94107, -100),
    ('ca', 1, None, 'bc', 94107, -100),
    ('us', 0, 'ca', None, 94106, 0),
    ('us', 0, 'ca', None, 94107, 0),
    ('ca', 1, None, 'bc', 94107, 0),
    ('ca', 1, None, 'bc', 94107, 0),
    ('us', 0, 'ca', None, 94107, 100),
    ('us', 0, 'ca', None, 94107, 100)
]

我想按以下类别进行分组：（ (country, state/provence, zip) ，然后在分组完成后找到“ Option列的计数，然后最终转换为字典。

理想情况下，我希望将dict的格式设置为：

{
    ('us', 'ca', 94107): {100: 2, -100: 1, 0: 1}, 
    ('us', 'ca', 94106): {0: 1},  
    ('ca', 'bc', 94107): {-100: 1, 0: 2}
}

到目前为止，我有以下代码：

# build the data frame
df = pd.DataFrame(rows, columns=['Country', 'LocFilter', 'State', 'Provence', 'Zip', 'Option'])

# consolidate "State" and "Provence" into "MainProvence" based on "LocFilter"
df['MainProvence'] = df.apply(lambda row: (row['Provence'] if row['LocFilter'] == 1 else row['State']), axis=1)

# group by and find distribution
distribution = df.groupby(by=['Country', 'MainProvence','Zip', 'Option'])['Option'].count()
# print the result
print distribution

这给了我以下内容-看起来不错：

Country  MainProvence  Zip    Option
ca       bc            94107  -100      1
                               0        2
us       ca            94106   0        1
                       94107  -100      1
                               0        1
                               100      2
Name: Option, dtype: int64

但是，当我将其转换为字典时：

print distribution.to_dict()

我得到这个：

{
    ('us', 'ca', 94107, 100): 2, 
    ('us', 'ca', 94106, 0): 1, 
    ('us', 'ca', 94107, -100): 1, 
    ('ca', 'bc', 94107, 0): 2, 
    ('ca', 'bc', 94107, -100): 1, 
    ('us', 'ca', 94107, 0): 1
}

根据我如何组成分组依据，这是可以理解的。 我显然可以在python中操作返回的字典以获得所需的格式-但是有没有办法使用熊猫获取这种格式？

Answer 1

这非常容易。 尝试：

distribution.unstack(level=['Option']).to_dict(orient='index')

要得到

{('ca', 'bc', 94107): {-100: 1.0, 0: 2.0, 100: nan},
 ('us', 'ca', 94106): {-100: nan, 0: 1.0, 100: nan},
 ('us', 'ca', 94107): {-100: 1.0, 0: 1.0, 100: 2.0}}

我认为，此时放弃nan不会带来太大的不便。

PS。 考虑使用：

df['MainProvence'] = df['State'].fillna(df['Provence'])

代替

df['MainProvence'] = df.apply(lambda row: (row['Provence'] if row['LocFilter'] == 1 else row['State']), axis=1)

PPS。 您将需要Pandas 0.17才能orient kwarg在to_dict()内部工作。

将groupby与pandas一起使用后查找唯一列元素的数量

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-11-02 00:17:17

将groupby与pandas一起使用后查找唯一列元素的数量

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-11-02 00:17:17

解决方案1
1 已采纳 2015-11-02 00:17:17