简体   繁体   中英

Python Pandas - Group by, then plot by category

Very easy pandas question, I'm a beginner.

I have a dataframe 'df' with (for example):

import pandas as pd
df = pd.DataFrame({'time': ['2019-04-23 10:21:00', '2019-04-23 11:14:00', '2019-04-24 11:30'], 
                   'category': ['A', 'B', 'A'],
                   'text': ['njrnfrjn','fmrjfmrfmr','mjrnfjrnmi']})

I just want to:

  • Group by category and dates (daily)
  • Count the number of text message by category and day
  • Plot all timeseries across days (one timeseries for each category in the same plot)

Thanks

You can try the following:

df.groupby([df.time.dt.floor('d'), "category"]).size().unstack().plot()

Explanations :

  • First step is to grouby as you mentioned. To do this, we use groupby
  • In the groupby , because we need to group the times by days, one solution is to use dt.floor on the time column. We pass the argument "d" for days .

    • Also, to be sure the floor is reachable, the time column must be a time series . If it's not, use pd.to_datetime to convert it with pd.to_datetime(df.time) .
  • Now we have the group, the size can be easily computed applying the size method.

  • The next step is to convert the category column (at this step as index) into columns. Because we groupby by two keys, we can use unstack .

  • Finally, call the plot one the dataframe. Because the dataframe is well structured, it works without any arguments (one line is drawn for each column and the index column ( time ) is used as x-axis .


Full code + illustration :

# import modules 
import pandas as pd
import matplotlib.pyplot as plt
# (here random is just for creating dummy data)
from random import randint, choice

# Create dummy data
size = 1000
df = pd.DataFrame({
    'time': pd.to_datetime(["2020/01/{} {}:{}".format(randint(1, 31), randint(0,23), randint(0,59)) for _ in range(size)]),
    'text': ['blablabla...' for _ in range(size)],
    'category': [choice(["A", "B", "C"]) for _ in range(size)]
})
print(df)
#                    time          text category
# 0   2020-01-30 23:15:00  blablabla...        C
# 1   2020-01-16 07:06:00  blablabla...        A
# 2   2020-01-03 18:47:00  blablabla...        A
# 3   2020-01-21 15:45:00  blablabla...        A
# 4   2020-01-10 04:11:00  blablabla...        C
# ..                  ...           ...      ...
# 995 2020-01-12 03:03:00  blablabla...        C
# 996 2020-01-08 10:35:00  blablabla...        B
# 997 2020-01-24 20:51:00  blablabla...        C
# 998 2020-01-05 07:39:00  blablabla...        A
# 999 2020-01-26 16:54:00  blablabla...        A

# See size result
print(df.groupby([df.time.dt.floor('d'), "category"]).size())
# time        category
# 2020-01-01  A            6
#             B           18
#             C            7
# 2020-01-02  A           10
#             B            8
#                         ..
# 2020-01-30  B           16
#             C           11
# 2020-01-31  A           14
#             B           17
#             C           11

# See unstack result
print(df.groupby([df.time.dt.floor('d'), "category"]).size().unstack())
# category     A   B   C
# time
# 2020-01-01   6  18   7
# 2020-01-02  10   8  13
# 2020-01-03  11  11  16
# 2020-01-04   9   5  10
# 2020-01-05  13   9  13
# 2020-01-06  11  11  12
# 2020-01-07  13   7   9
# 2020-01-08   5  16  13
# 2020-01-09  15   6  14
# 2020-01-10  10  11   9
# 2020-01-11   7  16  13
# 2020-01-12  12  13  13
# 2020-01-13  12   5   7
# 2020-01-14  11  10  11
# 2020-01-15  13  14  11
# 2020-01-16   9   8  13
# 2020-01-17   8   9   6
# 2020-01-18  12   5  11
# 2020-01-19   7   8  13
# 2020-01-20  12   9   9
# 2020-01-21   9  13  13
# 2020-01-22  14  11  19
# 2020-01-23  14   6  12
# 2020-01-24   7   8   6
# 2020-01-25  10  12  10
# 2020-01-26   8  12   7
# 2020-01-27  18  11   7
# 2020-01-28  15  10   9
# 2020-01-29  12   7  11
# 2020-01-30  12  16  11
# 2020-01-31  14  17  11

# Perform plot
df.groupby([df.time.dt.floor('d'), "category"]).size().unstack().plot()
plt.show()

output :

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM