[英]How to find count of unique pair values (on different rows and columns) of a dataframe and do its visualization in Python?
So, I have the following sample truncated dataset (sales data):因此,我有以下示例截断数据集(销售数据):
----------------------
Product Hour
PRODUCT_75 12
PRODUCT_75 11
PRODUCT_75 12
PRODUCT_75 12
PRODUCT_63 10
PRODUCT_63 5
PRODUCT_63 5
PRODUCT_12 1
PRODUCT_120 7
PRODUCT_120 5
PRODUCT_120 5
----------------------
Now, I need two things:现在,我需要两件事:
(a) A way to find the count of unique pairs of data items , and consequently, display which was the highest selling product at a particular hour of the day . (a) 一种查找唯一数据项对的计数并因此显示在一天中的特定时间销售最高的产品的方法。 For eg, PRODUCT_75 will have a count of '3' for the hour '12', so, supposing that is the highest selling product at that hour, I've to return that product name.例如, PRODUCT_75在“12”小时的计数为“3”,因此,假设这是该小时销量最高的产品,我必须返回该产品名称。 Similarly, I've to do this for all possible hours (from 0 to 23, which is there in my dataset).同样,我必须在所有可能的时间(从 0 到 23,在我的数据集中)都这样做。 For that I need a tentative dataframe like:为此,我需要一个暂定的数据框,例如:
--------------------------------
Product Hour Count
PRODUCT_75 12 3
PRODUCT_75 11 1
PRODUCT_75 12 3
PRODUCT_75 12 3
PRODUCT_63 10 2
PRODUCT_63 10 2
PRODUCT_63 5 2
PRODUCT_63 5 2
PRODUCT_12 1 1
PRODUCT_120 7 1
PRODUCT_120 5 3
PRODUCT_120 5 3
PRODUCT_120 5 3
--------------------------------
And as explained above, display the product with the highest count at all particular hours of the day (from 0-23) .并且如上所述,在一天中的所有特定时间(从 0 到 23)显示具有最高计数的产品。
(b) Secondly, is there a way to visualize the distribution of these highest-selling products at other hours? (b) 其次,有没有办法可视化这些最畅销产品在其他时间的分布? For example, PRODUCT_123 is the highest selling product at hour '5', so I need to visualize its distribution (how much it sold) in other hours.例如, PRODUCT_123是“5”小时销量最高的产品,所以我需要可视化它在其他时间的分布(销量)。
For the above dataset i need output something like:对于上述数据集,我需要输出如下内容:
Max. Sold Products On A Hourly Basis:
---------------------------
Hour Product Count
1 PRODUCT_12 1
5 PRODUCT_120 3
7 PRODUCT_120 1
10 PRODUCT_63 2
11 PRODUCT_75 1
12 PRODUCT_75 3
---------------------------
Now, for part (a), I've already employed the following code:现在,对于 (a) 部分,我已经使用了以下代码:
res = reshaped.groupby(['Product', 'Hour']).size()
where reshaped is the data frame with these columns.其中reshape是具有这些列的数据框。
It does return the count of unique pair values, but I don't know how to proceed after this.它确实返回唯一对值的计数,但我不知道在此之后如何进行。 I'd be grateful if anyone were to guide me.如果有人指导我,我将不胜感激。
The following code provides histograms of highest selling products (there may be more than one):以下代码提供了最畅销产品的直方图(可能不止一个):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.array([['PRODUCT_75',12 ], ['PRODUCT_75',11], ['PRODUCT_75' ,12],['PRODUCT_75',12],['PRODUCT_63',10],['PRODUCT_63',10],['PRODUCT_63',5],['PRODUCT_63',5],['PRODUCT_12',1],['PRODUCT_120',7],['PRODUCT_120',5],['PRODUCT_120',5],['PRODUCT_120',5]]),
columns=['Product','Hour'])
df['Hour']= df['Hour'].astype('int')
res = df.groupby(['Product', 'Hour']).size().reset_index()
res.rename(columns={0:'count'},inplace=True)
def histogram(df, product):
df[df['Product'] == product]['Hour'].hist()
plt.suptitle(str(product))
plt.show()
def highest_selling(res,hour):
highest_selling_product = res[res['Hour']==hour]['Product'][res['count']==res['count'].max()].to_list()
return highest_selling_product
highest_selling_product = highest_selling(res, 5)
for i in highest_selling_product:
histogram(df,i)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.