[英]How do I fill in values for columns based on matching few other column's row values in R
Data looks like below. 数据如下所示。
time <- c('Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000',
'Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000', 'Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000')
A <- c('20.79','NA','NA','NA','21.8','NA','NA')
B <- c('NA','97.017','94.321','85.014','NA','87.1','67.1')
C <- c('NA','C1','C2','C3','NA','C1','C2')
D <- c('L1','L1','L1','L1','L2','L2','L2')
C1 <- c('NA','NA','NA','NA','NA','NA','NA')
C2 <- c('NA','NA','NA','NA','NA','NA','NA')
C3 <- c('NA','NA','NA','NA','NA','NA','NA')
df <- data.frame(time,A,B,C,D,C1,C2,C3)
I need output in the below format. 我需要以下格式输出。
# time A B C D C1 C2 C3
# 1 Nov 1st 2014, 17:36:50.000 20.79 NA NA L1 97.02 94.321 85.014
Nov 1st 2014, 17:37:50.000 21.8 NA NA L2 87.1 67.1 47.3
How do I get the data in the above format in just one row as columns "time" and "D" are same for all the rows? 我如何只在一行中获得上述格式的数据,因为所有行的“时间”和“D”列相同?
Thanks in advance! 提前致谢!
You can do this with dplyr::gather()
to re-shape B into C1, C2, C3, and then dplyr::join()
it with the other columns, assuming a unique date/time. 您可以使用
dplyr::gather()
将B重新形成为C1,C2,C3,然后将dplyr::join()
与其他列重新形成,假定具有唯一的日期/时间。
library(dplyr)
library(tidyr)
df %>%
select(time, A, B, C, D) %>%
filter(!is.na(A)) %>%
left_join(
df %>%
select(time, C, B, D) %>%
spread(C, B) %>%
select(-`<NA>`),
by = c("time", "D")
)
# time A B C D C1 C2 C3
# 1 Nov 1st 2014, 17:36:50.000 20.79 NA <NA> L1 97.017 94.321 85.014
# 2 Nov 1st 2014, 17:37:50.000 21.80 NA <NA> L2 87.100 67.100 47.300
df <- read.table(text = "time A B C D C1 C2 C3
1 'Nov 1st 2014, 17:36:50.000' 20.79 NA NA L1 NA NA NA
2 'Nov 1st 2014, 17:36:50.000' NA 97.017 C1 L1 NA NA NA
3 'Nov 1st 2014, 17:36:50.000' NA 94.321 C2 L1 NA NA NA
4 'Nov 1st 2014, 17:36:50.000' NA 85.014 C3 L1 NA NA NA
5 'Nov 1st 2014, 17:37:50.000' 21.8 NA NA L2 NA NA NA
6 'Nov 1st 2014, 17:37:50.000' NA 87.1 C1 L2 NA NA NA
7 'Nov 1st 2014, 17:37:50.000' NA 67.1 C2 L2 NA NA NA
8 'Nov 1st 2014, 17:37:50.000' NA 47.3 C3 L2 NA NA NA",
header = T,
stringsAsFactors = F)
If I understand correctly, OP's dataset actually consists of two intermixed datasets: 如果我理解正确,OP的数据集实际上由两个混合数据集组成:
df
time ABCD C1 C2 C3 1 Nov 1st 2014, 17:36:50.000 20.79 NA NA L1 NA NA NA 2 Nov 1st 2014, 17:36:50.000 NA 97.017 C1 L1 NA NA NA 3 Nov 1st 2014, 17:36:50.000 NA 94.321 C2 L1 NA NA NA 4 Nov 1st 2014, 17:36:50.000 NA 85.014 C3 L1 NA NA NA 5 Nov 1st 2014, 17:37:50.000 21.8 NA NA L2 NA NA NA 6 Nov 1st 2014, 17:37:50.000 NA 87.1 C1 L2 NA NA NA 7 Nov 1st 2014, 17:37:50.000 NA 67.1 C2 L2 NA NA NA
which need to be separated: 需要分开的:
library(data.table)
df1 <- setDT(df)[A != "NA", .(time, A, D)]
df1
time AD 1: Nov 1st 2014, 17:36:50.000 20.79 L1 2: Nov 1st 2014, 17:37:50.000 21.8 L2
and 和
df2 <- df[A == "NA", .(time, B, C, D)]
df2
time BCD 1: Nov 1st 2014, 17:36:50.000 97.017 C1 L1 2: Nov 1st 2014, 17:36:50.000 94.321 C2 L1 3: Nov 1st 2014, 17:36:50.000 85.014 C3 L1 4: Nov 1st 2014, 17:37:50.000 87.1 C1 L2 5: Nov 1st 2014, 17:37:50.000 67.1 C2 L2
The key columns which identify unique subsets of rows are time
and D
. 标识行的唯一子集的关键列是
time
和D
Columns C1
, C2
, and C3
are dropped as they will be created in the next step. C1
, C2
和C3
列将被删除,因为它们将在下一步中创建。
The second dataset is to be reshaped from long to wide format: 第二个数据集将从长格式转换为宽格式:
wide <- dcast(df2, time + D ~ C, value.var = "B")
wide
time D C1 C2 C3 1: Nov 1st 2014, 17:36:50.000 L1 97.017 94.321 85.014 2: Nov 1st 2014, 17:37:50.000 L2 87.1 67.1 <NA>
Now both partial results can be joined together: 现在两个部分结果可以连接在一起:
df1[wide, on = .(time, D)]
time AD C1 C2 C3 1: Nov 1st 2014, 17:36:50.000 20.79 L1 97.017 94.321 85.014 2: Nov 1st 2014, 17:37:50.000 21.8 L2 87.1 67.1 <NA>
Note that columns B
and C
have been dropped from the result as they convey no information. 请注意,列
B
和C
已从结果中删除,因为它们不传达任何信息。
This steps above can be combined into fewer statements: 上述步骤可以合并为更少的语句:
library(data.table)
setDT(df)[, (paste0("C", 1:3)) := NULL]
df[A != "NA"][dcast(df[C != "NA"], time + D ~ C, value.var = "B"), on = .(time, D)]
time ABCD C1 C2 C3 1: Nov 1st 2014, 17:36:50.000 20.79 NA NA L1 97.017 94.321 85.014 2: Nov 1st 2014, 17:37:50.000 21.8 NA NA L2 87.1 67.1 <NA>
as provided by the OP with NA values given as strings 由OP提供的NA值作为字符串给出
time <- c('Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000',
'Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000', 'Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000')
A <- c('20.79','NA','NA','NA','21.8','NA','NA')
B <- c('NA','97.017','94.321','85.014','NA','87.1','67.1')
C <- c('NA','C1','C2','C3','NA','C1','C2')
D <- c('L1','L1','L1','L1','L2','L2','L2')
C1 <- c('NA','NA','NA','NA','NA','NA','NA')
C2 <- c('NA','NA','NA','NA','NA','NA','NA')
C3 <- c('NA','NA','NA','NA','NA','NA','NA')
df <- data.frame(time,A,B,C,D,C1,C2,C3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.