将多列组合成唯一标识符以分隔绘图数据

Question

我有一个大约 1000 个推文 ID 的 Pandas df，它们的生命周期以秒为单位（生命周期是第一次和最后一次转发之间的时间距离）。 下面是我的 df 的一个子集的头部：

推文ID	寿命（时间增量）	寿命（小时）	类型1	类型2	类型3	类型4
329664	0 天 05:27:22	5.456111	1	0	0	0
722624	0 天 12:43:43	12.728611	1	1	0	0
866498	2天 09:00:28	57.007778	0	1	1	0
156801	0 天 03:01:29	3.024722	1	0	0	0
941440	0 天 06:39:58	6.666111	0	1	1	1

注意1：推文的生命周期显示在两列中（列具有不同的数据类型）：

列lifetime(timedelta)以 timedelta64[ns] 格式显示推文生命周期，
列lifetime(hours)以小时为单位显示推文生命周期（float64 类型）。 我通过使用以下方法从生命周期（timedelta）列中提取小时数来创建第 2 列： df['lifetime_hours'] = df['lifetime(timedelta)'] / np.timedelta64(1, 'h')

注意2：一条推文可以属于多个类型。 例如，tweet id:329664 只是 type1，而 tweet id:722624 是 type1 和 type2。

我想绘制不同类型推文的推文寿命分布。 我可以绘制推文的生命周期分布如下（对于所有推文）：这是条形图：

这是情节：

这是我如何创建上述图（例如，条形图）：

bins = range(0, df['lifetime_hours'].max().astype(int), 3) 
data = pd.cut(df['lifetime_hours'], bins, include_lowest=True)

from matplotlib.pyplot import figure
plt.figure(figsize=(20,4))

data.value_counts().sort_index().plot(kind='bar')

plt.xlabel('Tweets Lifetime(hours)')
plt.ylabel('Number of Tweets Active')
plt.title('Distribution of Tweets lifetime')

我的问题是：如何在一个情节中绘制两种类型的推文的生命周期分布？

有人可以帮我吗？

Answer 1

为了按类型分隔数据，应该有一个标识符列。
- 这可以通过将0和1列值乘以列类型名称来创建，然后将列值连接到单个字符串中作为新列。
在python 3.10 、 pandas 1.4.2 、 matplotlib 3.5.1 、 seaborn 0.11.2

导入和数据框

import pandas as pd
import numpy as np
import seaborn as sns

# start data
data = {'tweet_id': [329664, 722624, 866498, 156801, 941440],
        'lifetime(timedelta)': [pd.Timedelta('0 days 05:27:22'), pd.Timedelta('0 days 12:43:43'), pd.Timedelta('2 days 09:00:28'),
                                pd.Timedelta('0 days 03:01:29'), pd.Timedelta('0 days 06:39:58')],
        'type1': [1, 1, 0, 1, 0], 'type2': [0, 1, 1, 0, 1], 'type3': [0, 0, 1, 0, 1], 'type4': [0, 0, 0, 0, 1]}
df = pd.DataFrame(data)

# insert hours columns
df.insert(loc=2, column='lifetime(hours)', value=df['lifetime(timedelta)'].div(pd.Timedelta('1 hour')))

# there can be 15 combinations of types for the 4 type columns
# it's best to rename the columns for ease of use
# rename the type columns; can also use df.rename(...)
cols = ['T1', 'T2', 'T3', 'T4']
df.columns = df.columns[:3].tolist() + cols

# create a new column as a unique identifier for types
types = df[cols].mul(cols).replace('', np.nan).dropna(how='all')
df['Types'] = types.apply(lambda row: ' '.join(row.dropna()), axis=1)

# create a column for the bins
bins = range(0, df['lifetime(hours)'].astype(int).add(4).max(), 3) 
df['Tweets Liftime(hours)'] = pd.cut(df['lifetime(hours)'], bins, include_lowest=True)

# display(df)
   tweet_id lifetime(timedelta)  lifetime(hours)  T1  T2  T3  T4     Types Tweets Liftime(hours)
0    329664     0 days 05:27:22         5.456111   1   0   0   0        T1            (3.0, 6.0]
1    722624     0 days 12:43:43        12.728611   1   1   0   0     T1 T2          (12.0, 15.0]
2    866498     2 days 09:00:28        57.007778   0   1   1   0     T2 T3          (57.0, 60.0]
3    156801     0 days 03:01:29         3.024722   1   0   0   0        T1            (3.0, 6.0]
4    941440     0 days 06:39:58         6.666111   0   1   1   1  T2 T3 T4            (6.0, 9.0]

创建频率表

ct = pd.crosstab(df['Tweets Liftime(hours)'], df['Types'])

# display(ct)
Types                  T1  T1 T2  T2 T3  T2 T3 T4
Tweets Liftime(hours)                            
(3.0, 6.0]              2      0      0         0
(6.0, 9.0]              0      0      0         1
(12.0, 15.0]            0      1      0         0
(57.0, 60.0]            0      0      1         0

阴谋

`pandas.DataFrame.plot`

用途ct

ax = ct.plot(kind='bar', figsize=(20, 5), width=0.1, rot=0)
ax.set(ylabel='Number of Tweets Active', title='Distribution of Tweets Lifetime')
ax.legend(title='Types', bbox_to_anchor=(1, 1), loc='upper left')

`seaborn.catplot`

使用df无需重塑

p = sns.catplot(kind='count', data=df, x='Tweets Liftime(hours)', height=4, aspect=4, hue='Types')
p.set_xticklabels(rotation=45)
p.fig.subplots_adjust(top=0.9)
p.fig.suptitle('Distribution of Tweets Lifetime')
p.axes[0, 0].set_ylabel('Number of Tweets Active')

将多列组合成唯一标识符以分隔绘图数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-07 05:09:53

导入和数据框

创建频率表

阴谋

`pandas.DataFrame.plot`

`seaborn.catplot`

将多列组合成唯一标识符以分隔绘图数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-07 05:09:53

导入和数据框

创建频率表

阴谋

pandas.DataFrame.plot

seaborn.catplot

解决方案1
1 已采纳 2022-07-07 05:09:53

`pandas.DataFrame.plot`

`seaborn.catplot`