简体   繁体   English

R - 重塑长到宽,按两个变量分组

[英]R - Reshape long to wide, grouping by two variables

I have a dataframe called result having 4 columns (x,y, label, NN.idx and dist) respectively representing the position of an observation in the plane, a label for avoiding (x,y) duplicates (see my remark below) the index of its nearest neighbour in another dataframe and distance to it.我有一个名为result的数据框,它有 4 列(x,y, label, NN.idx 和 dist),分别代表观察在平面中的位置,一个避免 (x,y) 重复的标签(见下面我的评论)它在另一个数据帧中的最近邻居的索引以及到它的距离。 Remark : Each (x,y) combination may appear one to three times, and if so, these are distinguished by a different label (eg rows 1,4 and 5 and in the example below).备注:每个 (x,y) 组合可能出现 1 到 3 次,如果出现,则通过不同的标签来区分(例如,第 1,4 行和第 5 行以及下面的示例中)。 Also, note that two different points may have the same label, which is a quantity I calculated from previous data manipulation, eg both rows 1 and 3 have the same label while they clearly not represent the same point (x,y).另外,请注意,两个不同的点可能具有相同的标签,这是我从之前的数据操作中计算出的数量,例如,第 1 行和第 3 行都具有相同的标签,但它们显然不代表相同的点 (x,y)。

Here is an example :这是一个例子:


result <- data.frame(x=c(0.147674, 0.235356 ,0.095337, 0.147674, 0.147674, 1.000000, 2.000000), y=c(0.132956, 0.150813, 0.087345, 0.132956, 0.132956, 2.000000, 1.000000), label = c(5,6,5,6,7,3,9), NN.idx =c(4325,2703,21282,3460,12,4,10), dist=c(0.02391247,0.03171236,0.01760940,0.03136304, 0.02315468, 0.01567365, 0.02314860))

head(result)

         x        y        label NN.idx        dist
1 0.147674 0.132956            5   4325  0.02391247
2 0.235356 0.150813            6   2703  0.03171236
3 0.095337 0.087345            5  21282  0.01760940
4 0.147674 0.132956            6   3460  0.03136304
5 0.147674 0.132956            7     12  0.02315468
6 1.000000 2.000000            3      4  5.00000000
7 2.000000 1.000000            9     10 11.00000000

What I would like to do is reshaping this dataframe very efficiently (the actual dataframe being much much larger) to a wide format where each row corresponds to a unique (x,y) combination and would present columns NN.idx_1, NN.idx_2, NN.idx_3, dist_1, dist_2, dist_3 giving the NN.idx and dist for each occurrence of the (x,y) combination in the original dataframe (and filling with NA if the (x,y) combination only appears twice or once)我想要做的是非常有效地将此数据框(实际数据框要大得多)重塑为宽格式,其中每行对应一个唯一的(x,y) 组合,并显示列 NN.idx_1、NN.idx_2, NN.idx_3、dist_1、dist_2、dist_3 给出原始数据帧中每次出现的 (x,y) 组合的 NN.idx 和 dist(如果 (x,y) 组合只出现两次或一次,则用 NA 填充)

I am relatively new to R and only know the basics, but I think I might have a solution using data.table and dcast as follows:我对 R 比较data.table ,只知道基础知识,但我想我可能有一个使用data.tabledcast的解决方案,如下所示:

df <- setDT(result)
df[,NN.counter := 1:.N, by=c("x","y")]
df <- dcast(df, x+y~ NN.counter, value.var=c("NN.idx","dist"))

head(df)

        x        y   NN.idx_1 NN.idx_2 NN.idx_3     dist_1     dist_2     dist_3
1: 0.095337 0.087345    21282       NA       NA 0.01760940         NA         NA
2: 0.147674 0.132956     4325     3460       12 0.02391247 0.03136304 0.02315468
3: 0.235356 0.150813     2703       NA       NA 0.03171236         NA         NA
4: 1.000000 2.000000        4       NA       NA 0.01567365         NA         NA
5: 2.000000 1.000000       10       NA       NA 0.02314860         NA         NA


My question is the following: is my approach ok?我的问题如下:我的方法好吗? I am not familiar with dcast and the notation x+y ~ NN.counter makes me wonder whether two different points (x,y) resulting in the same sum x+y would be considered as different (eg rows 6 and 7 of my original dataframe, where x and y are reversed).我不熟悉dcast并且符号x+y ~ NN.counter让我想知道导致相同和 x+y 的两个不同点 (x,y) 是否会被认为是不同的(例如,我原来的第 6 行和第 7 行数据帧,其中 x 和 y 颠倒)。 Apparently it seems to work.显然它似乎有效。

Does anyone have a better approach to deal this duplicate issue or is mine ok?有没有人有更好的方法来处理这个重复的问题,或者我的好吗? Also, I don't know if this is reasonably fast or not, though I've read that data.table is pretty fast.另外,我不知道这是否相当快,尽管我读过data.table非常快。

Since both x and y are both numeric , you might run into problems based on floating-point precision (ie, R FAQ 7.31 and IEEE-754 ).由于xy都是numeric ,您可能会遇到基于浮点精度的问题(即R FAQ 7.31IEEE-754 )。 While it might work, I don't know that I would strictly rely on it (without a lot of verification).虽然它可能有效,但我不知道我会严格依赖它(没有大量验证)。 It might be useful (for the purpose of reshaping) to coerce to fixed-length strings (eg, sprintf("%0.06f", x) ) before grouping and dcast ing.在分组和dcast之前强制转换为固定长度的字符串(例如, sprintf("%0.06f", x) )可能很有用(为了重塑)。

Here's a thought that does that workaround.这是一个解决方法的想法。 (Note: I'm using magrittr solely to break out steps with the %>% pipe, it is not required to function.) (注意:我仅使用magrittr来分解带有%>%管道的步骤,它不需要运行。)

library(data.table)
library(magrittr)
result <- data.table(x=c(0.147674, 0.235356 ,0.095337, 0.147674, 0.147674, 1.000000, 2.000000), y=c(0.132956, 0.150813, 0.087345, 0.132956, 0.132956, 2.000000, 1.000000), label = c(5,6,5,6,7,3,9), NN.idx =c(4325,2703,21282,3460,12,4,10), dist=c(0.02391247,0.03171236,0.01760940,0.03136304, 0.02315468, 0.01567365, 0.02314860))

result[, c("x_s", "y_s") := lapply(.(x, y), sprintf, fmt = "%0.09f") ]
savexy <- unique(result[, .(x, y, x_s, y_s) ]) # merge back in later with "real" numbers
result2 <- copy(result) %>%
  .[, c("x", "y") := NULL ] %>%
  .[, NN.counter := seq_len(.N), by = c("x_s", "y_s") ] %>%
  dcast(x_s + y_s ~ NN.counter, value.var = c("NN.idx", "dist") ) %>%
  merge(., savexy, by = c("x_s", "y_s"), all.x = TRUE) %>%
  .[, c("x_s", "y_s") := NULL ] %>%
  setcolorder(., c("x", "y"))
result2
#           x        y NN.idx_1 NN.idx_2 NN.idx_3     dist_1     dist_2     dist_3
# 1: 0.095337 0.087345    21282       NA       NA 0.01760940         NA         NA
# 2: 0.147674 0.132956     4325     3460       12 0.02391247 0.03136304 0.02315468
# 3: 0.235356 0.150813     2703       NA       NA 0.03171236         NA         NA
# 4: 1.000000 2.000000        4       NA       NA 0.01567365         NA         NA
# 5: 2.000000 1.000000       10       NA       NA 0.02314860         NA         NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM