简体   繁体   English

R合并数据帧中的行

[英]R merging rows in a dataframe

Here is the head of a large data frame 这是一个大数据框的head

head(Hdata_soil)
                      X_id           timestamp address rssi batt_v soil_temp_1 soil_temp_2 soil_temp_3 soil_moisture_1
1 565846060dd8e408e3817c58 2015-11-27 12:01:10      A8  -65     NA          NA          NA          NA              NA
2 565846070dd8e408e3817c59 2015-11-27 12:01:11      A8   NA     NA        9.73     -273.15       14.63             647
3 565846cf0dd8e408e3817caf 2015-11-27 12:04:31      A7  -64     NA          NA          NA          NA              NA
4 565846cf0dd8e408e3817cb0 2015-11-27 12:04:31      A7   NA     NA        8.56        9.46        9.64             660
5 565847650dd8e408e3817cf5 2015-11-27 12:07:01      A8  -64     NA          NA          NA          NA              NA
6 565847660dd8e408e3817cf6 2015-11-27 12:07:02      A8   NA     NA        9.82     -273.15       14.29             643

The full data set can be accessed from dropbox 可以从Dropbox访问完整数据集

As you can see there are 2 consecutive observations for each address with timestamps approx 1 s apart. 如您所见,每个address有2个连续观察值, timestamps大约相隔1秒。 The variables are split between these 2 observations. 变量在这两个观察之间分开。 How can I go about merging them in to one row, conserving the first timestamp ? 我怎样才能将它们合并到一行,保留第一个timestamp

It would also be great to make sure that this only happen with 2 consecutive observations from the same address . 确保只有来自同一address连续2次观察才会发生这种情况。

I would really appreciate it if someone could point me in the right direction with regards to packages /functions to use. 如果有人能指出我正在使用的包/功能方向,我真的很感激。

Check out the following code which should meet your needs. 查看以下代码,以满足您的需求。 First, the timestamps column is converted to an object of class 'POSIXlt' which allows to determine the time difference between single observations. 首先,将timestamps列转换为类'POSIXlt'的对象,该对象允许确定单个观察之间的时间差。 Then I use foreach to loop over all lines in parallel and skip all those records that have already been merged into another during a previous iteration (saved in the vector 'used'). 然后我使用foreach并行循环遍历所有行,并跳过在前一次迭代期间已经合并到另一行的所有记录(保存在向量'used'中)。 which in combination with difftime allows to identify consecutive observations (eg, within 5s away from the currently processed observation). whichdifftime结合允许识别连续观察(例如,在距当前处理的观察5s内)。 Finally (and only if the 'address' of the current observation is present in the candidate records), the lines are being merged, replacing missing values in the currently processed line with values from the consecutive observation. 最后(并且仅当候选记录中存在当前观察的'地址'时),合并线,用连续观察的值替换当前处理的线中的缺失值。

## load 'foreach' package
library(foreach)

## import and reformat data
Hdata_soil <- read.csv("Hdata_soil.csv", header = TRUE, 
                       stringsAsFactors = FALSE)

## reformat timestamps
timestamps <- strptime(Hdata_soil$timestamp, format = "%Y-%m-%d %H:%M:%S")

## vector with information about merged lines
used <- integer()
dat_out <- foreach(i = 1:length(timestamps), .combine = "rbind") %do% {

  ## skip current iteration if line has already been merged into another line
  if (i %in% used)
    return(NULL)

  ## identify consecutive observation (<5s)
  x <- timestamps[i]
  y <- timestamps[(i+1):length(timestamps)]

  # (subset same or consecutive days to reduce 
  # computation time of 'difftime')
  id_day <- which(as.Date(y) == as.Date(x) | 
                    as.Date(y) == (as.Date(x) + 1))
  y <- y[id_day]

  # (subset records within 5s from current observation)
  id_sec <- which(difftime(y, x, units = "secs") < 5)
  id <- id_day[id_sec]

  ## if consecutive observation(s) exist(s) and include address of 
  ## current observation, perform merge
  if (length(id) > 0 & 
        any(Hdata_soil[i+id, "address"] == Hdata_soil[i, "address"])) {

    for (j in 1:length(id)) {
      Hdata_soil_x <- data.frame(Hdata_soil[i, ])
      Hdata_soil_y <- data.frame(Hdata_soil[i+id[j], ])

      # overwrite all missing values in current line with values 
      # from consecutive line
      Hdata_soil_x[which(is.na(Hdata_soil_x) & !is.na(Hdata_soil_y))] <- 
        Hdata_soil_y[which(is.na(Hdata_soil_x) & !is.na(Hdata_soil_y))]

      # update information about merged lines
      used <- c(used, i, i+id[j])
    }

    # return merged line
    return(Hdata_soil_x)

  ## else return current line as is  
  } else {
    used <- c(used, i)
    return(data.frame(Hdata_soil[i, ]))
  }
}

However, the code takes quite long to perform which seems to be related with difftime . 但是,代码需要很长时间才能执行,这似乎与difftime有关。

 >     user   system  elapsed 
 > 2209.504   99.389 2311.996 

I think the following should work. 我认为以下内容应该有效。 Make a vector of unique address labels. 制作唯一地址标签的向量。 Then for each address label, extract the relevant rows and use various functions to pick the row you want (eg the minimum timestamp, the rssi value that isn't NA etc.). 然后,对于每个地址标签,提取相关行并使用各种函数来选择所需的行(例如,最小时间戳,非NA的rssi值等)。 Use rbind.data.frame to rebuild from the list at the end. 使用rbind.data.frame从最后的列表重建。

unad <- unique(Hdata_soil$address)

lst <- lapply(unad, function(ad){
    recs <- Hdata_soil[Hdata_soil$address == ad,]
    X_id <- recs$X_id[1]
    ts <- min(recs$timestamp)
    rssi <- recs$rssi[!is.na(recs$rssi)]
    if(length(rssi) == 0L) rssi <- NA else if(length(rssi) >= 2L) rssi <- mean(rssi) # or something - ensure end up with length 1
    ## remaining observations like rssi
    ## ...
    return(data.frame(X_id = X_id, timestamp = ts, address = ad, rssi = rssi, ...))
})

result <- do.call(rbind.data.frame, lst)

adaptation re comment: 适应评论:

mtx <- matrix(1:nrow(Hdata_soil), nrow(Hdata_soil), 2)
col.names(mtx) <- c("startR", "endR")

# identifies consecutive duplicate addresses and groups together into subsets
for(r in 1:(nrow(mtx) - 1)){
    with(Hdata_soil, if(identical(address[r], address[r + 1])){
        mtx[r, 2] <- mtx[r, 2] + 1
        mtx[r + 1,] <- c(NA, NA)
    })
}

#remove nas - essentially noting that duplicate addresses have been grouped
mtx <- mtx[!is.na(mtx[, 1]),]

lst <- lapply(1:nrow(mtx), function(r){
    datsubset <- Hdata_soil[mtx[r, "startR"]:mtx[r, "endR"],, drop = FALSE]
    # aggregate the subset of rows into one row as you please
})

result <- do.call(rbind.data.frame, lst)

Note this will need some adaptation if there are any address which occur three times consecutively. 请注意,如果有任何地址连续出现三次,则需要进行一些调整。

First of all, I think your data needs an extra id column, because address is not unique per row pair, neither is any other column considering how you want to group them. 首先,我认为您的数据需要一个额外的id列,因为每个行对的address不是唯一的,也不考虑您想要如何对它们进行分组。 For the sake of simplicity I will define the id column here as: 为简单起见,我将在此定义id列:

df$id <- as.character(c(1,1,2,2,3,3))

Then we can do the following 然后我们可以做以下事情

# Replace NA's by 0
df[is.na(df)] <- 0

# Extract numeric columns
tokeep <- which(sapply(df,is.numeric))

# Sum numeric columns per id
setDT(df)[,lapply(.SD,sum),by=id,.SDcols = tokeep]

Which yields: 产量:

   id rssi soil_temp_1 soil_temp_2 soil_temp_3 soil_moisture_1
1:  1  -65        9.73     -273.15       14.63             647
2:  2  -64        8.56        9.46        9.64             660
3:  3  -64        9.82     -273.15       14.29             643

You can consequently merge this with the remaining non numeric columns of your original df, choosing which unique values you want to drop in X_id and timestamp . 因此,您可以将其与原始df的其余非数字列合并,选择要在X_idtimestamp删除的唯一值。

this solution using dplyr may work if you are confident that rounding the "timestamp" to the nearest minute will provide a unique identifier in conjunction with "address": 如果您确信将“时间戳”舍入到最近的分钟将提供与“地址”相关的唯一标识符,则使用dplyr的此解决方案可能有效:

library(readr) # Required only for recreating your data frame
library(dplyr)

Hdata_soil <- readr::read_csv("X_id,timestamp,address,rssi,batt_v,soil_temp_1,soil_temp_2,soil_temp_3,soil_moisture_1
565846060dd8e408e3817c58,27/11/2015 12:01:10,A8,-65,NA,NA,NA,NA,NA
565846070dd8e408e3817c59,27/11/2015 12:01:11,A8,NA,NA,9.73,-273.15,14.63,647
565846cf0dd8e408e3817caf,27/11/2015 12:04:31,A7,-64,NA,NA,NA,NA,NA
565846cf0dd8e408e3817cb0,27/11/2015 12:04:31,A7,NA,NA,8.56,9.46,9.64,660
565847650dd8e408e3817cf5,27/11/2015 12:07:01,A8,-64,NA,NA,NA,NA,NA
565847660dd8e408e3817cf6,27/11/2015 12:07:02,A8,NA,NA,9.82,-273.15,14.29,643")

# Dplyr chain to create new vars, group then summarise
Hdata_soil <- dplyr::mutate(
  Hdata_soil,
  # Convert timestamp to POSIXct
  timestamp = as.POSIXct(strptime(timestamp, format = "%d/%m/%Y %H:%M:%S"))
  # Round to nearest minute
  , timestamp_round = as.POSIXct(round(timestamp, units = "mins"))
) %>% 
  # Group by nearest minute timestamps and address
  dplyr::group_by(timestamp_round, address) %>% 
  # Take minimum non-NA value
  dplyr::summarise_each(
    funs(min(., na.rm = TRUE))
  )

Which yields: 产量:

> # Print
> Hdata_soil
Source: local data frame [3 x 10]
Groups: timestamp_round [?]

      timestamp_round address                     X_id           timestamp  rssi batt_v soil_temp_1 soil_temp_2 soil_temp_3 soil_moisture_1
               (time)   (chr)                    (chr)              (time) (int)  (lgl)       (dbl)       (dbl)       (dbl)           (int)
1 2015-11-27 12:01:00      A8 565846060dd8e408e3817c58 2015-11-27 12:01:10   -65     NA        9.73     -273.15       14.63             647
2 2015-11-27 12:05:00      A7 565846cf0dd8e408e3817caf 2015-11-27 12:04:31   -64     NA        8.56        9.46        9.64             660
3 2015-11-27 12:07:00      A8 565847650dd8e408e3817cf5 2015-11-27 12:07:01   -64     NA        9.82     -273.15       14.29             643

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM