简体   繁体   English

根据列条件连接数据框行

[英]Concatenating data frame rows based on column condition

For subsequent discussion, I will refer to the example data frame below:对于后续的讨论,我将参考下面的示例数据框:

在此处输入图像描述

Now, what I wish to achieve is to group all the packet times that are similar - ie all the 7s, 12s, etc. Furthermore, the PacketTime field should contain the difference in min and max ( max(PacketTime) - min(PacketTime) ), and the FrameLen , IPLen and TCPLen fields should be lists of all the values that correspond to the grouped time.现在,我希望实现的是对所有相似的数据包时间进行分组 - 即所有 7s、12s 等。此外, PacketTime字段应包含 min 和 max 的差异( max(PacketTime) - min(PacketTime) ),并且FrameLenIPLenTCPLen字段应该是对应于分组时间的所有值的列表。 For example for the 7s group, FrameLen would contain c(304, 276, 276) .例如对于 7s 组, FrameLen将包含c(304, 276, 276)

My solution for the above is as follows:我对上述问题的解决方案如下:

df <- packets %>%
  group_by(round(PacketTime)) %>%
  summarise(
    PTime=max(PacketTime)-min(PacketTime),
    FLen=list(FrameLen),
    ILen=list(IPLen),
    Movement=0
  ) %>%
  rename(PacketTime=PTime) %>%
  rename(FrameLen=FLen) %>%
  rename(IPLen=ILen)
df$"round(PacketTime)" <- NULL # Remove the group_by

However, some of these crossover (ie 1480s also includes part of 1481s).但是,其中一些分频器(即 1480s 还包括 1481s 的一部分)。 The part here, which makes this a little easier (in some regard) is that each of the groups are separated by 5s timing window (via Python time.sleep(5) ).这里的部分使这更容易(在某些方面)是每个组都由 5s 时间 window (通过 Python time.sleep time.sleep(5) )分隔。

How can I achieve the previous result, but only relying on the 5s difference between the groups that also takes into account the crossover ?我怎样才能达到以前的结果,但只依靠组之间的 5s 差异也考虑到交叉

EDIT: As suggested by Ben, here is the dput() of my dataframe df[1:20,] :编辑:正如 Ben 所建议的,这是我的 dataframe df[1:20,]dput()

structure(list(PacketTime = c(7.083779, 7.147268, 7.147462, 12.084768, 
12.153246, 12.153951, 17.095972, 17.159268, 17.159876, 22.11384, 
22.176926, 22.177467, 27.134427, 27.199108, 27.200064, 32.144442, 
32.208648, 32.20922, 37.144255, 37.205622), FrameLen = c(304L, 
276L, 276L, 304L, 276L, 276L, 304L, 276L, 276L, 304L, 276L, 276L, 
304L, 276L, 276L, 304L, 276L, 276L, 304L, 276L), IPLen = c(300L, 
272L, 272L, 300L, 272L, 272L, 300L, 272L, 272L, 300L, 272L, 272L, 
300L, 272L, 272L, 300L, 272L, 272L, 300L, 272L), TCPLen = c(260L, 
232L, 232L, 260L, 232L, 232L, 260L, 232L, 232L, 260L, 232L, 232L, 
260L, 232L, 232L, 260L, 232L, 232L, 260L, 232L), Movement = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
20L), class = "data.frame")

Here is a base R solution using aggregate + transform这是使用aggregate + transform的基本 R 解决方案

u <- aggregate(
    . ~ PacketTime,
    transform(df,
        PTime = ave(PacketTime, trunc(PacketTime), 
        FUN = function(x) diff(range(x))), PacketTime = trunc(PacketTime)
    ),
    c
)
dfout <- transform(u, PTime = sapply(PTime, unique))

which gives这使

> dfout
  PacketTime      FrameLen         IPLen        TCPLen Movement    PTime
1          7 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.063683
2         12 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.069183
3         17 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.063904
4         22 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.063627
5         27 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.065637
6         32 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.064778
7         37      304, 276      300, 272      260, 232     0, 0 0.061367

One approach is to use seq and cut .一种方法是使用seqcut Create a sequence from your minimum to maximum times, every 5 seconds.每 5 秒创建一个从最小到最大时间的序列。 Then, use cut to put your times in intervals.然后,使用cut将您的时间间隔。 You can use the interval for the labels, for example: (7-12 sec) by omitting the labels argument.您可以使用标签的间隔,例如:(7-12 秒)通过省略labels参数。 Or just use the lower time of the interval (7 sec) as done below.或者只是使用间隔的较低时间(7 秒),如下所示。

library(tidyverse)

my_breaks <- seq(trunc(min(packets$PacketTime)), max(packets$PacketTime) + 5, 5)
packets$Interval <- cut(packets$PacketTime, breaks = my_breaks, labels = my_breaks[-length(my_breaks)], right = FALSE)

packets %>%
  group_by(Interval) %>%
  summarise(
    PTime=max(PacketTime)-min(PacketTime),
    FLen=list(FrameLen),
    ILen=list(IPLen),
    Movement=0
  ) %>%
  rename(PacketTime=PTime) %>%
  rename(FrameLen=FLen) %>%
  rename(IPLen=ILen)

Output Output

# A tibble: 7 x 5
  Interval PacketTime FrameLen  IPLen     Movement
  <fct>         <dbl> <list>    <list>       <dbl>
1 7            0.0637 <int [3]> <int [3]>        0
2 12           0.0692 <int [3]> <int [3]>        0
3 17           0.0639 <int [3]> <int [3]>        0
4 22           0.0636 <int [3]> <int [3]>        0
5 27           0.0656 <int [3]> <int [3]>        0
6 32           0.0648 <int [3]> <int [3]>        0
7 37           0.0614 <int [2]> <int [2]>        0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM