简体   繁体   English

如何在`df2`中添加一个变量,使用`dplyr`或`data.table`从`df1`指定变量的特定级别的行数

[英]How to add a variable in `df2` that specify the number of rows of a specific level of a variable from `df1` using `dplyr` or `data.table`

I have a dataframe df1 that summarizes detections of a fish species over time thanks to the use of acoustic transmitters (attached to the fish) and acoustic receivers (placed in the area).由于使用了声学发射器(连接到鱼)和声学接收器(放置在该区域),我有一个数据帧df1总结了随着时间的推移对鱼类的检测。 Those transmitters have two sensors, one for measuring activity and other for measuring the fish depth.这些发射器有两个传感器,一个用于测量活动,另一个用于测量鱼的深度。 The transmitters only can send one kind of data (either activity or depth) at a time, and they send the signal every several minutes as a minimum.发射器一次只能发送一种数据(活动或深度),并且它们至少每隔几分钟发送一次信号。 In the end, what we get is a dataframe with the time for the detection of a fish ( DateTime ), the receiver that detected the individual ( Receiver ), the transmitter that was detected ( Transmitter ) and also the type of info that the transmitter sent ( Sensor ).最后,我们得到的是一个数据帧,其中包含检测鱼的时间( DateTime )、检测到个体的ReceiverReceiver )、检测到的TransmitterTransmitter )以及Transmitter的信息类型发送( Sensor )。 Below I show a reproducible example:下面我展示了一个可重现的例子:

df1<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 15:01:45","2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 16:29:16","2016-08-01 16:33:16","2016-08-01 16:54:16","2016-08-01 16:58:16","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:28:16","2016-08-01 17:29:28","2016-08-01 17:42:08"),
                Receiver=c( "V6", "V7", "V6", "V6", "V7", "V7", "V6", "V6", "V6", "V7", "V7", "V7", "V6", "V6", "V6", "V9", "V7", "V4" ),
                Transmitter=c(16 , 17, 16, 16, 17, 16, 17, 16, 16, 16, 17, 16, 16, 17, 17, 17, 16, 17),
                Sensor=c("Activity","Depth","Activity","Activity","Depth","Activity","Activity","Depth","Activity","Activity","Activity","Depth","Activity","Activity","Depth","Activity","Activity","Activity"))
df1$DateTime<- as.POSIXct(df1$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")

df1

              DateTime Receiver Transmitter   Sensor
1  2016-08-01 12:04:07       V6          16 Activity
2  2016-08-01 12:06:07       V7          17    Depth
3  2016-08-01 13:12:12       V6          16 Activity
4  2016-08-01 14:04:07       V6          16 Activity
5  2016-08-01 15:01:45       V7          17    Depth
.            .                .           .       .
.            .                .           .       .

What I want is to create a dataframe df2 in which I have this information arranged in a different way.我想要的是创建一个数据帧df2在其中我以不同的方式排列这些信息。 I want to use hourly intervals in which each hour covers half an hour before and half an hour after ( RoundTime ).我想使用每小时的时间间隔,其中每小时包括前半小时和后半小时( RoundTime )。 For every RoundTime I want for each transmitter ( Transmitter ) the number of times that was detected ( Num_det ), the number of different receivers that detected it ( Num_Rec ), the code of those receivers ( Which_Rec ), the number of detections with Activity info ( n_Activity ) and the number of detections with Depth info ( n_Depth ).对于每个RoundTime我想要每个发射器( Transmitter )检测到的次数( Num_det ),检测到它的不同接收器的数量( Num_Rec ),这些接收器的代码( Which_Rec ),带有Activity信息的检测次数( n_Activity ) 和具有Depth信息 ( n_Depth ) 的检测n_Depth I would expect this:我希望这样:

df2
             RoundTime Transmitter Num_det n_Activity n_Depth Num_Rec Which_Rec
1  2016-08-01 12:00:00          16       1          1       0       1        V6
2  2016-08-01 12:00:00          17       1          0       1       1        V7
3  2016-08-01 13:00:00          16       1          1       0       1        V6
4  2016-08-01 13:00:00          17       0          0       0      NA      <NA>
5  2016-08-01 14:00:00          16       1          1       0       1        V6
6  2016-08-01 14:00:00          17       0          0       0      NA      <NA>
7  2016-08-01 15:00:00          16       0          0       0      NA      <NA>
8  2016-08-01 15:00:00          17       1          0       1       1        V7
9  2016-08-01 16:00:00          16       2          1       1       2     V6 V7
10 2016-08-01 16:00:00          17       1          1       0       1        V6
11 2016-08-01 17:00:00          16       5          4       1       2     V6 V7
12 2016-08-01 17:00:00          17       4          3       1       3  V6 V7 V9
13 2016-08-01 18:00:00          16       0          0       0      NA      <NA>
14 2016-08-01 18:00:00          17       1          1       0       1        V4

So far I got df2 with all the variables except n_Activity and n_Depth .到目前为止,除了n_Activityn_Depth之外,我得到了df2的所有变量。 Here I show the code and the result:这里我展示了代码和结果:

library(lubridate)
library(tidyverse)
df2<-df1 %>% 
   # grouped by rounding the date by hour, Transmitter column
   group_by(RoundTime = round_date(DateTime, "hour"), Transmitter) %>% 
   # get the Num_det as number of rows, add more groups
   group_by(Num_det = n(), 
           which_Rec = toString(sort(unique(Receiver))), add = TRUE) %>%        
   # get the number of distinct elements of Receiver
   summarise(Num_Rec = n_distinct(Receiver)) %>% 
   ungroup %>% 
   # expand the data to fill the missing combinations 
   complete(RoundTime, Transmitter, fill = list(Num_det = 0))%>% 
   select(RoundTime, Transmitter, Num_det, Num_Rec, which_Rec)

df2
# A tibble: 14 x 5
   RoundTime               Transmitter Num_det Num_Rec which_Rec 
   <dttm>                        <dbl>   <dbl>   <int> <chr>     
 1 2016-08-01 12:00:00.000          16       1       1 V6        
 2 2016-08-01 12:00:00.000          17       1       1 V7        
 3 2016-08-01 13:00:00.000          16       1       1 V6        
 4 2016-08-01 13:00:00.000          17       0      NA NA        
 5 2016-08-01 14:00:00.000          16       1       1 V6        
 6 2016-08-01 14:00:00.000          17       0      NA NA        
 7 2016-08-01 15:00:00.000          16       0      NA NA        
 8 2016-08-01 15:00:00.000          17       1       1 V7        
 9 2016-08-01 16:00:00.000          16       2       2 V6, V7    
10 2016-08-01 16:00:00.000          17       1       1 V6        
11 2016-08-01 17:00:00.000          16       5       2 V6, V7    
12 2016-08-01 17:00:00.000          17       4       3 V6, V7, V9
13 2016-08-01 18:00:00.000          16       0      NA NA        
14 2016-08-01 18:00:00.000          17       1       1 V4     

Does anyone know which code I should add to the proposed before in order to create the variables n_Activity and n_Depth ?有谁知道我应该在之前的提议中添加哪些代码以创建变量n_Activityn_Depth If you know how to do it with the package data_table is even better since my real dataframe has millions of rows and data.table is more efficient.如果您知道如何使用包data_table会更好,因为我的真实数据帧有数百万行,而data.table效率更高。

I guess all you need to do is count the number of "Activity" and "Depth" per group in your current code and I don't know why you have two group_by there.我想您需要做的就是计算当前代码中每组“活动”和“深度”的数量,我不知道为什么您在那里有两个group_by

library(dplyr)
library(lubridate)

df1 %>% 
  group_by(RoundTime = round_date(DateTime, "hour"), Transmitter) %>% 
  summarise(Num_det = n(), 
            which_Rec = toString(sort(unique(Receiver))),
            Num_Rec = n_distinct(Receiver), 
            n_Activity = sum(Sensor == "Activity"), 
            n_Depth = sum(Sensor == "Depth")) %>%
   ungroup %>% 
   tidyr::complete(RoundTime, Transmitter, 
           fill = list(Num_det = 0, n_Activity = 0, n_Depth = 0))


# A tibble: 14 x 7
#   RoundTime           Transmitter Num_det which_Rec  Num_Rec n_Activity n_Depth
#   <dttm>                    <dbl>   <dbl> <chr>        <int>      <dbl>   <dbl>
# 1 2016-08-01 12:00:00          16       1 V6               1          1       0
# 2 2016-08-01 12:00:00          17       1 V7               1          0       1
# 3 2016-08-01 13:00:00          16       1 V6               1          1       0
# 4 2016-08-01 13:00:00          17       0 NA              NA          0       0
# 5 2016-08-01 14:00:00          16       1 V6               1          1       0
# 6 2016-08-01 14:00:00          17       0 NA              NA          0       0
# 7 2016-08-01 15:00:00          16       0 NA              NA          0       0
# 8 2016-08-01 15:00:00          17       1 V7               1          0       1
# 9 2016-08-01 16:00:00          16       2 V6, V7           2          1       1
#10 2016-08-01 16:00:00          17       1 V6               1          1       0
#11 2016-08-01 17:00:00          16       5 V6, V7           2          4       1
#12 2016-08-01 17:00:00          17       4 V6, V7, V9       3          3       1
#13 2016-08-01 18:00:00          16       0 NA              NA          0       0
#14 2016-08-01 18:00:00          17       1 V4               1          1       0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 df2 中的数据以 45 分钟的时间间隔和 data.table 正确计算 df1 中变量的平均值? - How to calculate properly average values of a variable in `df1` using data from `df2` at 45-minutes time intervals with `data.table`? 如何根据 df1 中的开始日期和结束日期,使用 df2 中变量的总和在 df1 中创建新变量? - how to create a new variable in df1 with the sum of a variable in df2 based on a start and end date in df1? 在 df2 中识别 df1 中的元素,然后在 df2 中使用 R 重合的那些行中添加列 - Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R 如何创建一个新的数据框“df2”,它汇总了数据框“df1”的行数,但以日期时间为条件 - How to create a new data frame `df2` that summarises the number of rows of a data frame `df1` but conditioned to the DateTime 当`df1$DateTime_1`在5秒间隔内与`df2$DateTime_2`匹配时,如何将变量`df1$DateTime_1`添加到`df2` - How to add variable `df1$DateTime_1` to `df2` when `df1$DateTime_1` match within a 5-seconds interval with `df2$DateTime_2` 根据 df1 和 df2 之间的匹配,将列从 df2 添加到 df1 - Add column from df2 to df1 based on match between df1 and df2 当`df1$DateTime==df2$DateTime`时,如何将`DateTime`从`df1`更改为`DateTime2`从`df2`。 对于 df1 的其余行,我减去 60s - How to change `DateTime` from `df1` to `DateTime2` from `df2` when `df1$DateTime==df2$DateTime`. For the rest of rows of `df1` I subtract 60s r在df1中添加df2中的行数(条件) - r add columns in df1 with count of rows in df2 (conditional) 使用df2中通过失败的条件过滤df1中的行 - Filter rows in df1 using conditions for passfail coluimn in df2 如何删除 df2 中存在的 df1 中的行? - how to remove rows in df1 present in df2?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM