简体   繁体   English

在 R 或 Python 中使用 If Else 语句创建组、中断和条件

[英]Create groups, breaks and conditions using If Else statements in R or Python

I have a large dataset (2 million records), df, that I am trying to Group and create Breaks within datetimes.我有一个大型数据集(200 万条记录)df,我试图在日期时间内对它进行分组和创建中断。 I would like to define a group and create these "breaks", if the following conditions apply: (This is a large dataset, and I do not know the contents of the subject, recipients and length columns)如果满足以下条件,我想定义一个组并创建这些“中断”:(这是一个大数据集,我不知道主题、收件人和长度列的内容)

 If the edit == "T"
 If the message is ""
 If the folder is "out" or "draft"

I'd like to then match these groups if the last values of the length column match the values of the next groups, first row of the length column.如果长度列的最后一个值与下一组(长度列的第一行)的值匹配,我想然后匹配这些组。 So for instance, the value '80' connects the groups as well as the edit is T, folder is out or draft and message is blank.因此,例如,值“80”连接组以及编辑为 T、文件夹已删除或草稿和消息为空白。

 subject    recipient                  length   folder    message  date                       edit
                                        80      out                1/2/2020 1:00:01 AM        T                                    
                                        80      out                1/2/2020 1:00:05 AM        T                   
hey        sarah@mail.com,g@mail.com    80      out                1/2/2020 1:00:10 AM        T
hey        sarah@mail.com,g@mail.com    80      out                1/2/2020 1:00:15 AM        T
hey        sarah@mail.com,g@mail.com    80      out                1/2/2020 1:00:30 AM        T
some       k                           900      in       jjjjj     1/2/2020 1:00:35 AM        F
some       k                           900      in       jjjjj     1/2/2020 1:00:36 AM        F 
some       k                           900      in       jjjjj     1/2/2020 1:00:37 AM        F
hey        sarah@mail.com,g@mail.com    80    draft                1/2/2020 1:02:00 AM        T
hey        sarah@mail.com,g@mail.com    80    draft                1/2/2020 1:02:05 AM        T    
no         a                          900       in        iii      1/2/2020 1:02:10 AM        F
no         a                          900       in        iii      1/2/2020 1:02:15 AM        F
no         a                          900       in        iii      1/2/2020 1:02:20 AM        F
no         a                          900       in        iii      1/2/2020 1:02:25 AM        F
hey        sarah@mail.com,g@mail.com   80    draft                 1/2/2020 1:03:00 AM        T
hey        sarah@mail.com,g@mail.com   80    draft                 1/2/2020 1:03:20 AM        T

Then I would like to link these groups together if the length on the last row of one block, matches the length columns of the first row of the next block.然后,如果一个块的最后一行的长度与下一个块的第一行的长度列匹配,我想将这些组链接在一起。 I have started modifying the code below, but am un-sure of how to execute this.我已经开始修改下面的代码,但不确定如何执行它。

This is the desired output:这是所需的输出:

 Start                  End                        Duration          Group  Subject  Length
 1/2/2020 1:00:01 AM    1/2/2020 1:00:30 AM        29                A      hey       80
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A      hey       80
 1/2/2020 1:03:00 AM    1/2/2020 1:03:20 AM        20                A      hey       80

All of the these are in the same group A, because the last row of the Length column match the next groups first row of Length column.所有这些都在同一组 A 中,因为 Length 列的最后一行与 Length 列的下一组第一行匹配。

library(tidyverse)
library(lubridate)





df$Date <- lubridate::dmy_hms(df$Date)

df <- mutate_if(df, is.factor, as.character)


df$GROUP <- "Edit == "T", Folder == "out"|"draft", Message == """
df$BREAK_DETECTOR <- ""
group_count <- 0
break_count <- 0
for (i in 1:nrow(df)) {



if (i == 1) {
group_count <- group_count + 1
df$GROUP[[i]] <- letters[[group_count]]
}
if (i > 1) {
if (df$GROUP[[i - 1]] != "") {
  df$GROUP[[i]] <- df$GROUP[[i - 1]]
} else {
  group_count <- group_count + 1
  df$GROUP[[i]] <- letters[[group_count]]
 }
 }
   if (i == 1) {
   break_count <- break_count + 1
df$BREAK_DETECTOR[[i]] <- break_count
 } else { #rules for detecting breaks - I chose to make it depend on NA values in the Length field
if (is.na(df$Length[[i]])) {
  if (!is.na(df$Length[[i - 1]])) { # and only if the previous line isnt also NA for Length
    break_count <- break_count + 1
  }
}
df$BREAK_DETECTOR[[i]] <- break_count
   }
 }


  df2 <- df %>%
  filter(!is.na(Length)) %>%
  group_by(
 GROUP, BREAK_DETECTOR
) %>%
summarise(
start = min(Date),
end = max(Date),
duration = difftime(end, start, units = "secs"),
min_subject = min(Subject),
max_subject = max(Subject),
min_recipient = min(Recipient),
max_recipient = max(Recipient),
min_length = min(Length),
max_length = max(Length)
) %>%
  ungroup()

Here is the dput for this:这是用于此的 dput:

structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L, 
4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", 
"hey", "no", "some"), class = "factor"), Recipient = structure(c(1L, 
1L, 5L, 5L, 5L, 4L, 4L, 4L, 5L, 5L, 3L, 3L, 3L, 3L, 5L, 5L, 1L, 
2L), .Label = c("", " ", "a", "k", "sarah@mail.com,gee@mail.com"
), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 900L, 
900L, 900L, 80L, 80L, 900L, 900L, 900L, 900L, 80L, 80L, NA, NA
), Folder = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 2L, 2L, 
3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "draft", "in", 
"out"), class = "factor"), Message = structure(c(1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("", 
 "jjjjjjj", "llll"), class = "factor"), Date = structure(c(2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 1L, 1L), .Label = c("", "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", 
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", 
"1/2/2020 1:00:35 AM", "1/2/2020 1:00:36 AM", "1/2/2020 1:00:37 AM", 
 "1/2/2020 1:02:00 AM", "1/2/2020 1:02:05 AM", "1/2/2020 1:02:10 AM", 
"1/2/2020 1:02:15 AM", "1/2/2020 1:02:20 AM", "1/2/2020 1:02:25 AM", 
"1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"), class = "factor"), 
 Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, 
 TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, NA, NA
 )), class = "data.frame", row.names = c(NA, -18L))

Using dplyr :使用dplyr

library(dplyr)

df %>%
  #Add row number
  mutate(row = row_number(), 
  #Convert to Posixct
         Date = lubridate::mdy_hms(Date)) %>%
  #Keep only TRUE rows
  filter(Edit) %>%
  #Create groups
  group_by(gr = cumsum(c(TRUE, diff(row) > 1))) %>%
  #Get first, last and difference between the dates
  summarise(Start = first(Date), 
            End = last(Date), 
            Duration = difftime(End, Start, "secs"), 
            Group = "A", Subject = "hey", Length = 80) %>%
   select(-gr)

# A tibble: 3 x 6
#  Start               End                 Duration Group Subject Length
#  <dttm>              <dttm>              <drtn>   <chr> <chr>    <dbl>
#1 2020-01-02 01:00:01 2020-01-02 01:00:30 29 secs  A     hey         80
#2 2020-01-02 01:02:00 2020-01-02 01:02:05  5 secs  A     hey         80
#3 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs  A     hey         80

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM