![](/img/trans.png)
[英]How to group and aggregate data using pandas/Python only if a specific condition/calculation is met?
[英]How to Aggregate and Sum in Python (or R) with a Specific Condition
客觀的
我有一個數據集 df,我想對長度列進行分組,取其總和,並顯示與其關聯的結束時間:
length start end duration
6330 12/17/2019 10:34:23 AM 12/17/2019 10:34:31 AM 8
57770 12/19/2019 5:19:56 PM 12/17/2019 5:24:19 PM 263
6330 12/17/2019 10:34:54 AM 12/17/2019 10:35:00 AM 6
6330 12/18/2019 4:36:44 PM 12/18/2019 4:37:13 PM 29
57770 12/19/2019 5:24:47 PM 12/19/2019 5:26:44 PM 117
期望輸出
length end total Duration
6330 12/18/2019 4:37:13 PM 43
57770 12/19/2019 5:26:44 PM 380
輸出
structure(list(length = c(6330L, 57770L, 6330L, 6330L, 57770L
), start = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("12/17/2019 10:34:23 AM",
"12/17/2019 10:34:54 AM", "12/18/2019 4:36:44 PM", "12/19/2019 5:19:56 PM",
"12/19/2019 5:24:47 PM"), class = "factor"), end = structure(c(1L,
3L, 2L, 4L, 5L), .Label = c("12/17/2019 10:34:31 AM", "12/17/2019 10:35:00 AM",
"12/17/2019 5:24:19 PM", "12/18/2019 4:37:13 PM", "12/19/2019 5:26:44 PM"
), class = "factor"), duration = c(8L, 263L, 6L, 29L, 117L)), class = "data.frame", row.names = c(NA,
-5L))
這是我嘗試過的:,但是我如何還顯示與“最新”長度值相關聯的結束列? 例如,長度 6330 有 3 個結束值,附加了 3 個持續時間:
12/17/2019 10:34:31 AM 8
12/17/2019 10:35:00 AM 6
12/18/2019 4:37:13 PM 29
12/18/2019 4:37:13 PM is the latest end time, so I would like to output the end time,
along with the sum of durations for this particular length value.
期望輸出
length end total Duration
6330 12/18/2019 4:37:13 PM 43
57770 12/19/2019 5:26:44 PM 380
這是我嘗試過的:
import pandas as pd
import numpy as np
df1 = df.groupby('length')['duration'].sum()
但是,它只輸出長度和總持續時間。 我將如何輸出該特定長度的長度、最新結束時間以及總持續時間?
任何幫助表示贊賞。
在R
,我們可以按“長度”分組,使用summarise
並獲取“持續時間”的sum
,並在使用mdy_hms
(來自lubridate
)轉換為 DateTime 類后提取“結束”的max
元素
library(dplyr)
library(lubridate)
df %>%
group_by(length) %>%
summarise(duration = sum(duration), end = end[which.max(mdy_hms(end))])
Pandas 我們可以使用GroupBy.agg
來實現這一點,但這里我們有兩個選擇:
df.groupby('length').agg({'duration': 'sum', 'end': 'max'}).reset_index()
length duration end
0 6330 43 2019-12-18 16:37:13
1 57770 380 2019-12-19 17:26:44
pandas 0.25.0+
以來的新pandas 0.25.0+
df.groupby('length').agg(
end=('end', 'max'),
total_duration=('duration', 'sum')
).reset_index()
length end total_duration
0 6330 2019-12-18 16:37:13 43
1 57770 2019-12-19 17:26:44 380
注意:不要忘記將日期列轉換為日期時間:
df[['start', 'end']] = (
df[['start', 'end']].apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))
)
在 R 中,可以使用一些tidyverse
庫來完成:
library(tidyverse)
df <- tribble(
~length,~start,~end,~duration,
6330,"12/17/2019 10:34:23 AM","12/17/2019 10:34:31 AM",8,
57770,"12/19/2019 5:19:56 PM","12/17/2019 5:24:19 PM",263,
6330,"12/17/2019 10:34:54 AM","12/17/2019 10:35:00 AM",6,
6330,"12/18/2019 4:36:44 PM","12/18/2019 4:37:13 PM",29,
57770,"12/19/2019 5:24:47 PM","12/19/2019 5:26:44 PM",117
) %>%
mutate_at(
vars(start, end),
lubridate::mdy_hms
)
df %>%
group_by(length) %>%
summarise(
end = max(end, na.rm = TRUE),
duration = sum(duration, na.rm = TRUE)
)
給予:
# A tibble: 2 x 3
length end duration
<dbl> <dttm> <dbl>
1 6330 2019-12-18 16:37:13 43
2 57770 2019-12-19 17:26:44 380
時間戳采用 ISO 格式。
我在轉換值時使用了默認的 TZ (UTC)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.