Pandas / Matplotlib bar plot with multi index dataframe

Question

我有一個排序的多索引 pandas 數據框，我需要在條形圖中顯示 plot。 我的數據框。

我要么還沒有找到解決方案，要么簡單的解決方案不存在，但我需要 plot 此數據的條形圖，其中Content和Category位於 x 軸上， Installs量為高度。

簡單來說，我需要展示每個條形由什么組成，例如Everyone占 20%， Teen占 40% 等等......我不確定這是否可能，因為手段的平均值不會是可能，因為樣本量不同，因此我制作了一個Uploads列來計算它，但平均還沒有達到 plot。

我認為按累積繪圖會給出錯誤的結果。

我需要 plot 帶有 X-ticks 的條形圖作為Category ，（最好只是前 10 個）然后每個 X-ticks 都有一個Content條，並不總是 3，可能只是“Everyone”和“Teen”以及身高每個欄的Installs量。

理想情況下，它應該是這樣的：條形圖

但每個欄都有此特定Category的Content欄。

我曾嘗試使用DataFrame.unstack()進行展平，但它破壞了數據框的排序，因此使用了Cat2 = Cat1.reset_index(level = [0,1]) ，但仍然需要繪圖方面的幫助。

到目前為止我有：

Cat = Popular.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum"})
Uploads = Popular[["Category","Content"]].value_counts().rename_axis(["Category","Content"]).reset_index(name = "Uploads")
Cat = pd.merge(Cat, Uploads, on = ["Category","Content"])
Cat = Cat.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum", "Uploads": "sum"})

這給了這個

結果

然后我這樣排序

Cat1 = Cat.unstack() 
Cat1 = Cat1.sort_index(key = (Cat1["Installs"].sum(axis = 1)/Cat1["Uploads"].sum(axis = 1)).get, ascending = False).stack()

感謝這些解決方案之一

這就是我的全部。

數據集來自 Kaggle，超過 600MB，不要指望任何人下載它，但至少是一個解決方案的簡單指南。

PS 這應該可以幫助我以相同的方式拆分下面散點圖 plot 中的每個點，但如果不是，那也沒關系。

PSS 我沒有足夠的聲譽來張貼圖片，所以為鏈接道歉

Answer 1

編輯：添加代碼以計算每個“類別”的“安裝”百分比。

數據集很大，但您應該提供模擬數據以輕松重現示例，如下所示：

import pandas as pd
import numpy as np


categories = ["Productivity", "Arcade", "Business", "Social"]
contents = ["Everyone", "Matute", "Teen"]

index = pd.MultiIndex.from_product(
    [categories, contents], names=["Category", "Content"]
)
installs = np.random.randint(low=100, high=999, size=len(index))

df = pd.DataFrame({"Installs": installs}, index=index)

>>> df

                       Installs
Category     Content
Productivity Everyone       149
             Matute         564
             Teen           301
Arcade       Everyone       926
             Matute         542
             Teen           556
Business     Everyone       879
             Matute         921
             Teen           323
Social       Everyone       329
             Matute         320
             Teen           426

如果要計算每個“類別”的“安裝”百分比，請使用groupby().apply() ：

>>> df["Installs (%)"] = (
...     df["Installs"]
...     .groupby(by="Category", group_keys=False)
...     .apply(lambda df: df / df.sum() * 100)
... )
>>> df

                       Installs  Installs (%)
Category     Content
Productivity Everyone       513     22.246314
             Matute         839     36.383348
             Teen           954     41.370338
Arcade       Everyone       122     10.581093
             Matute         519     45.013010
             Teen           512     44.405898
Business     Everyone       412     31.164902
             Matute         698     52.798790
             Teen           212     16.036309
Social       Everyone       874     52.555622
             Matute         326     19.603127
             Teen           463     27.841251

然后你可以只.unstack()一次：

>>> df = df.unstack()
>>> df

             Installs             Installs (%)
Content      Everyone Matute Teen     Everyone     Matute       Teen
Category
Arcade            499    904  645    24.365234  44.140625  31.494141
Business          856    819  438    40.511122  38.760057  20.728822
Productivity      705    815  657    32.384015  37.436840  30.179146
Social            416    482  238    36.619718  42.429577  20.950704

然后 bar plot 你想要的功能：

fig, (ax, ax_percent) = plt.subplots(ncols=2, figsize=(14, 5))

df["Installs"].plot(kind="bar", rot=True, ax=ax)
ax.set_ylabel("Installs")

df["Installs (%)"].plot(kind="bar", rot=True, ax=ax_percent)
ax_percent.set_ylabel("Installs (%)")
ax_percent.set_ylim([0, 100])

plt.show()

Answer 2

ChatGPT 已經回答了我的問題

import pandas as pd
import matplotlib.pyplot as plt

# create a dictionary of data for the DataFrame
data = {
    'app_name': ['Google Maps', 'Uber', 'Waze', 'Spotify', 'Pandora'],
    'category': ['Navigation', 'Transportation', 'Navigation', 'Music', 'Music'],
    'rating': [4.5, 4.0, 4.5, 4.5, 4.0],
    'reviews': [1000000, 50000, 100000, 500000, 250000]
}

# create the DataFrame
df = pd.DataFrame(data)

# set the 'app_name' and 'category' columns as the index
df = df.set_index(['app_name', 'category'])

# add a new column called "content_rating" to the DataFrame, and assign a content rating to each app
df['content_rating'] = ['Everyone', 'Teen', 'Everyone', 'Everyone', 'Teen']

# Grouping the Data by category and content_rating and getting the mean of reviews
df_grouped = df.groupby(['category','content_rating']).agg({'reviews':'mean'})

# Reset the index to make it easier to plot
df_grouped = df_grouped.reset_index()

# Plotting the stacked bar chart
df_grouped.pivot(index='category', columns='content_rating', values='reviews').plot(kind='bar', stacked=True)

這是一個示例數據集

我所做的是向數據集添加一個總和列，並按這個總和對其進行排序。

piv = qw1.reset_index()
piv = piv.pivot_table(index='Category', columns='Content', values='per')#.plot(kind='bar', stacked = True)
piv["Sum"] = piv.sum(axis=1)
piv_10 = piv.sort_values(by = "Sum", ascending = False)[["Adult", "Everyone", "Mature", "Teen"]].head(10)

其中 qw1 是多索引數據框。

那么所有要做的就是plot吧：

piv_10.plot.bar(stacked = True, logy = False)

Pandas / Matplotlib bar plot with multi index dataframe

問題描述

2 個解決方案

解決方案1
0 2023-01-12 07:57:06

解決方案2
0 已采納 2023-01-14 21:41:56

Pandas / Matplotlib bar plot with multi index dataframe

問題描述

2 個解決方案

解決方案1 0 2023-01-12 07:57:06

解決方案2 0 已采納 2023-01-14 21:41:56

解決方案1
0 2023-01-12 07:57:06

解決方案2
0 已采納 2023-01-14 21:41:56