熊猫数据框：groupby和具有两个不同列的图

Question

I am a super beginner for Python. 我是Python的超级初学者。 Long story short, I want to groupby with one column, apply one function to one column, apply another function to another column, and plot the results(the first column to the x-axis, the second column to the y-axis). 长话短说，我想用一个列分组，将一个函数应用于一个列，将另一个函数应用于另一列，然后绘制结果（第一列到x轴，第二列到y轴）。

I have a pandas data frame df which contains many columns. 我有一个包含许多列的pandas数据框df 。 Two columns of them are tour_id and tour_distance . 其中两列是tour_id和tour_distance 。

tour_id    tour_distance    
      A               10
      A               10
      A               10
      A               10
      B               20
      B               20
      C               40
      C               40
      C               40
      C               40
      C               40
      :                :
      :                :

Since I assume that the longer tour_distance becomes, the more rows each tour_id has, I want to plot a histogram of tour_distance vs row counts in each group of tour_id . 由于我假设tour_distance越长，每个tour_id拥有的行数就越多，因此我想绘制tour_distance的直方图与各tour_id组中的行数的tour_id 。

Question 1: what's the simplest solution for this groupby and plot problem? 问题1：这个groupby和plot问题最简单的解决方案是什么？

Question 2: how can I improve my failed attempt? 问题2：如何改善失败的尝试？

My attempt: I thought it would be easier to make a new data frame like this. 我的尝试：我认为制作这样的新数据框会更容易。

tour_id    tour_distance  row_counts
      A               10           3
      B               20           2
      C               40           5
      :                :           :

In this way I can use matplotlib and do like this, 这样，我可以使用matplotlib并这样做，

import matplotlib.pyplot as plt
x = df.tour_distance
y = df.row_counts
plt.bar(x,y)

However, I can't make this data frame. 但是，我无法制作此数据框。

df_tour_distance = df.groupby('tour_id').tour_distance.head(1)
df_tour_distance = pd.DataFrame(df_tour_distance)
df_size = df.groupby('tour_id').tour_distance.size()
df_size = pd.DataFrame(df_size)
df = pd.merge(df_size, df_tour_distance, on='tour_id')

>>> KeyError: 'tour_id'

This also failed: 这也失败了：

g = df.groupby('tour_id')
result = g.agg({'Count':lambda x:x.size(), 
            'tour_distance_grouped':lambda x:x.head(1)})
result

>>> KeyError: 'Count'

Answer 1

The problem in your code is that once you groupby tour_id , it becomes index. 您的代码中的问题是，一旦您对tour_id ，它就会成为索引。 You have to specify as_index=False or use reset_index() in order to use it. 您必须指定as_index=False或使用reset_index()才能使用它。 Also, you do not need to find a series and then merge it back. 另外，您无需查找序列，然后将其合并回去。

You need: 你需要：

g = df.groupby(['tour_id', 'tour_distance']).size().reset_index(name='count')
plt.bar(g['tour_id'],g['count'])

Output: 输出：

Answer 2

Could be implemented somewhat easier: 可以更容易实现：

import pandas as pd

tour_id = ['A']*4+['B']*2+['C']*5
tour_distance = [10]*4+[20]*2+[40]*5

df = pd.DataFrame({'tour_id': tour_id, 'tour_distance': tour_distance})
df = df.set_index('tour_id')

df2 = pd.DataFrame()
df2['tour_distance'] = df.groupby('tour_id')['tour_distance'].head(1)
df2['row_counts'] = df.groupby('tour_id').count()
print(df2)

Result: 结果：

         tour_distance  row_counts
tour_id                           
A                   10           4
B                   20           2
C                   40           5

熊猫数据框：groupby和具有两个不同列的图

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-07-20 17:31:19

解决方案2
0 2018-07-20 17:35:56

熊猫数据框：groupby和具有两个不同列的图

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-07-20 17:31:19

解决方案2 0 2018-07-20 17:35:56

解决方案1
2 已采纳 2018-07-20 17:31:19

解决方案2
0 2018-07-20 17:35:56