简体   繁体   English

Python & Pandas - 按天分组并计算每一天

[英]Python & Pandas - Group by day and count for each day

I am new on pandas and for now i don't get how to arrange my time serie, take a look at it :我是熊猫的新手,现在我不知道如何安排我的时间系列,看看它:

date & time of connection
19/06/2017 12:39
19/06/2017 12:40
19/06/2017 13:11
20/06/2017 12:02
20/06/2017 12:04
21/06/2017 09:32
21/06/2017 18:23
21/06/2017 18:51
21/06/2017 19:08
21/06/2017 19:50
22/06/2017 13:22
22/06/2017 13:41
22/06/2017 18:01
23/06/2017 16:18
23/06/2017 17:00
23/06/2017 19:25
23/06/2017 20:58
23/06/2017 21:03
23/06/2017 21:05

This is a sample of a dataset of 130 k raws,I tried : df.groupby('date & time of connection')['date & time of connection'].apply(list)这是一个 130 k df.groupby('date & time of connection')['date & time of connection'].apply(list)数据集的样本,我试过: df.groupby('date & time of connection')['date & time of connection'].apply(list)

Not enough i guess我猜还不够

I think i should :我想我应该:

  • Create a dictionnary with index from dd/mm/yyyy to dd/mm/yyyy创建一个索引从 dd/mm/yyyy 到 dd/mm/yyyy 的字典
  • Convert "date & time of connection" type dateTime to Date将“连接日期和时间”类型的日期时间转换为日期
  • Group and count Date of "date & time of connection" “连接日期和时间”的分组和计数日期
  • Put the numbers i count inside the dictionary ?把我数的数字放在字典里?

What do you think about my logic ?你怎么看我的逻辑? Do you know some tutos ?你知道一些教程吗? Thank you very much非常感谢你

You can use dt.floor for convert to date s and then value_counts or groupby with size :您可以使用dt.floor转换为date s,然后value_countsgroupbysize

df = (pd.to_datetime(df['date & time of connection'])
       .dt.floor('d')
       .value_counts()
       .rename_axis('date')
       .reset_index(name='count'))
print (df)
        date  count
0 2017-06-23      6
1 2017-06-21      5
2 2017-06-19      3
3 2017-06-22      3
4 2017-06-20      2

Or:要么:

s = pd.to_datetime(df['date & time of connection'])
df = s.groupby(s.dt.floor('d')).size().reset_index(name='count')
print (df)
  date & time of connection  count
0                2017-06-19      3
1                2017-06-20      2
2                2017-06-21      5
3                2017-06-22      3
4                2017-06-23      6

Timings :时间

np.random.seed(1542)

N = 220000
a = np.unique(np.random.randint(N, size=int(N/2)))
df = pd.DataFrame(pd.date_range('2000-01-01', freq='37T', periods=N)).drop(a)
df.columns = ['date & time of connection']
df['date & time of connection'] = df['date & time of connection'].dt.strftime('%d/%m/%Y %H:%M:%S')
print (df.head()) 

In [193]: %%timeit
     ...: df['date & time of connection']=pd.to_datetime(df['date & time of connection'])
     ...: df1 = df.groupby(by=df['date & time of connection'].dt.date).count()
     ...: 
539 ms ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [194]: %%timeit
     ...: df1 = (pd.to_datetime(df['date & time of connection'])
     ...:        .dt.floor('d')
     ...:        .value_counts()
     ...:        .rename_axis('date')
     ...:        .reset_index(name='count'))
     ...: 
12.4 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [195]: %%timeit
     ...: s = pd.to_datetime(df['date & time of connection'])
     ...: df2 = s.groupby(s.dt.floor('d')).size().reset_index(name='count')
     ...: 
17.7 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

To make sure your columns in in date format.确保您的列采用日期格式。

df['date & time of connection']=pd.to_datetime(df['date & time of connection'])

Then you can group the data by date and do a count:然后您可以按日期对数据进行分组并进行计数:

df.groupby(by=df['date & time of connection'].dt.date).count()
Out[10]: 
                           date & time of connection
date & time of connection                           
2017-06-19                                         3
2017-06-20                                         2
2017-06-21                                         5
2017-06-22                                         3
2017-06-23                                         6

Hey I found easy way to do this with resample.嘿,我找到了使用重新采样的简单方法。

# Set the date column as index column.
df = df.set_index('your_date_column')

# Make counts
df_counts = df.your_date_column.resample('D').count() 

Although your column name is long and contains spaces, which makes me a little cringy.虽然你的列名很长并且包含空格,这让我有点害怕。 I would use dashes instead of spaces.我会使用破折号而不是空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM