[英]Pandas group Excel data by column and Graph Scatter Plot With Mean
I have a collection of data I am reading out from several Excel files.我从几个 Excel 文件中读取了一组数据。 I can easily read, merge and group the data with pandas.我可以使用 Pandas 轻松读取、合并和分组数据。 I have two columns of interest in the data, 'Product Type' and 'Test Duration.'我对数据有两列感兴趣,“产品类型”和“测试持续时间”。
The dataframe containing the data read from the Excel file is called oData.包含从 Excel 文件读取的数据的数据框称为 oData。
oDataGroupedByProductType = oData.groupby(['Product Type'])
I have used plotly to make a graph as follows, but plotly does not keep the data private and if I want the data to be private I have to pay.我已经使用 plotly 绘制如下图,但 plotly 不会将数据保密,如果我希望数据保密,我必须付费。 Paying is not an option.付费不是一种选择。 How can I make the same graph with pandas and/or matplotlib, but also with the mean for each product type displayed?如何使用 Pandas 和/或 matplotlib 制作相同的图形,同时还显示每种产品类型的平均值?
As Bound says, you can do it a few lines with stripplot (Example of the seaborn documentation page).正如Bound所说,你可以用stripplot做几行(seaborn 文档页面的示例)。
import seaborn as sns
sns.set_style("whitegrid")
tips = sns.load_dataset("tips") # load some sample data
ax = sns.stripplot(x="day", y="total_bill", data=tips)
Suppose you have some dataframe:假设你有一些数据框:
In [4]: df.head(20)
Out[4]:
product value
0 c 5.155740
1 c 8.983128
2 c 5.150390
3 a 8.379866
4 c 8.094536
5 c 7.464706
6 b 3.690430
7 a 5.547448
8 a 7.709569
9 c 8.398026
10 a 7.317957
11 b 7.821332
12 b 8.815495
13 c 6.646533
14 c 8.239603
15 c 7.585408
16 a 7.946760
17 c 5.276864
18 c 8.793054
19 b 11.573413
You need to have a numeric value for the product to plot it, so quick-and-drity, just make a new column by mapping numeric values:您需要有一个产品的数值来绘制它,所以快速而干燥,只需通过映射数值创建一个新列:
In [5]: product_map = {p:r for p,r in zip(df['product'].unique(), range(1, df.values.shape[0]+1))}
In [6]: product_map
Out[6]: {'a': 2, 'b': 3, 'c': 1}
Of course, there are many ways you could achieve this...当然,有很多方法可以实现这一点......
Now, make a new column:现在,创建一个新列:
In [8]: df['product_code'] = df['product'].map(product_map)
In [9]: df.head(20)
Out[9]:
product value product_code
0 c 5.155740 1
1 c 8.983128 1
2 c 5.150390 1
3 a 8.379866 2
4 c 8.094536 1
5 c 7.464706 1
6 b 3.690430 3
7 a 5.547448 2
8 a 7.709569 2
9 c 8.398026 1
10 a 7.317957 2
11 b 7.821332 3
12 b 8.815495 3
13 c 6.646533 1
14 c 8.239603 1
15 c 7.585408 1
16 a 7.946760 2
17 c 5.276864 1
18 c 8.793054 1
19 b 11.573413 3
Now, use the plot
helper method in pandas
which is basically a wrapper around matplotlib
:现在,使用plot
中的辅助方法pandas
基本上是围绕一个包装matplotlib
:
In [10]: df.plot(kind='scatter', x = 'product_code', y = 'value')
Out[10]: <matplotlib.axes._subplots.AxesSubplot at 0x12235abe0>
And the output:和输出:
Clearly, this was quick and dirty, but it should get you on your way...显然,这是快速而肮脏的,但它应该让你继续前进......
In case someone else has a very similar problem and wants to see the end results, I ended up using seaborn, as follows:如果其他人有非常相似的问题并希望看到最终结果,我最终使用了 seaborn,如下所示:
import seaborn as sns
import matplotlib.pyplot as plt
...
sns.set_style("whitegrid")
sns.boxplot(x=oData['Product Type'],
y=oData['Test Duration?'],
data=oData)
plt.savefig('Test Duration vs. Product Type.png')
The graph came out as follows.图表如下。 For privacy reasons, I have blurred the product labels on the graph.出于隐私原因,我模糊了图表上的产品标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.