简体   繁体   English

根据R中的多个列值创建值序列

[英]Create sequence of values based on multiple column values in R

I have a data.frame that is a result of a near neighbor search of points, and it has three columns: V1 represents the index of closest point, V2 the second closest point, and V3 the third: 我有一个data.frame,它是近邻点搜索的结果,它有三列:V1代表最近点的索引,V2代表第二个最近点,V3代表第三个:

search_result <- structure(list(V1 = c(1350L, 1390L, 1411L, 1437L, 1444L, 1895L, 
                                       1895L, 1467L, 1478L, 1500L), 
                                V2 = c(1351L, 1391L, 1410L, 1438L, 
                                       1907L, 1456L, 1456L, 1466L, 1477L, 1499L), 
                                V3 = c(1349L, 1389L, 1940L, 1913L, 1445L, 1894L, 
                                       1894L, 1884L, 1479L, 1501L)), 
                           row.names = c(NA, -10L), 
                           class = "data.frame")

As I want the closest neighbor point, I would select V1 as the result and I would be fine. 因为我想要最近的邻居点,我会选择V1作为结果,我会没事的。 It happens that I also want the index to be ordered, and V1 has some index that are out of order. 碰巧我还想要索引索引,而V1有一些乱序的索引。 So I want to create a column that will give me the value of V1 (when it's in order) or the value of V2 or V3 (and V2 has the priority) so the order is preserved. 所以我想创建一个列,它将给出V1的值(当它按顺序时)或V2或V3的值(和V2具有优先级),因此保留了顺序。 In this case the result would be like: 在这种情况下,结果将是:

     V1   V2   V3 ordered
1  1350 1351 1349    1350
2  1390 1391 1389    1390
3  1411 1410 1940    1411
4  1437 1438 1913    1437
5  1444 1907 1445    1444
6  1895 1456 1894    1456 #take V2 instead
7  1895 1456 1894    1456 #take V2 instead
8  1467 1466 1884    1467
9  1478 1477 1479    1478
10 1500 1499 1501    1500

I tried to take the minimum value of each column, but there are cases later on the dataset which the max value would be the desired (not the best option, but closer to the expected). 我试图获取每列的最小值,但是稍后会在数据集中出现最大值所需的情况(不是最佳选项,但更接近预期)。 In the example below, there is discontinuity on rows 2, 4, 5 and 6, so I would take the value of V2 (priority) or V3 as the desired, so the "order" is maintained: 在下面的示例中,第2,4,5和6行存在不连续性,因此我将V2(优先级)或V3的值作为所需值,因此保持“顺序”:

# it's harder to see the "order" here, but it starts in V1 = 1881

   V1   V2   V3  ordered
1 1881 1470 1880    1881
2 1457 1893 1894    1893 #take V2 instead
3 1907 1444 1906    1907
4 1442 1443 1908    1908 #take V3 instead
5 1433 1918 1432    1918 #take V2 instead
6 1402 1949 1401    1949 #take V2 instead
7 1968 1969 1967    1968
8 1985 1986 1984    1985
9 1992 1993 1991    1992

The full dataset has 2500 points, and the "unordered" values happen in roughly 10% of it, so I can estimate what's the "order". 完整的数据集有2500个点,“无序”值大约占10%,所以我可以估算出什么是“顺序”。

Any base tidyverse or data.table help would be appreciated. 任何base tidyversedata.table帮助将不胜感激。 Thanks! 谢谢!

It sounds like what you want to do is iterate over each column returned by the search and first each row, keeping the first value that satisfies the indices being in order. 听起来你想要做的就是迭代搜索返回的每一列,并且首先是每一行,保持满足索引的第一个值。

Start with assuming the first column is in order. 首先假设第一列是有序的。 Move to the second column and replace any rows where this is not true. 移至第二列并替换不存在的任何行。 Move to the third column, comparing to your updated ordered column. 移至第三列,与更新的有序列进行比较。 Continue for all the columns. 继续所有列。

There may be a more optimized way of coding this (such as checking if the answer converges prior to iterating all of the columns) but here is a compact way to achieve this (note the lag function is dplyr::lag not stats::lag ) : 可能有一种更优化的编码方式(例如检查答案是否在迭代所有列之前收敛)但这是实现此目的的一种紧凑方式(注意lag函数是dplyr::lag not stats::lag ):

library(dplyr)
library(purrr)

# using the second data set
# assuming at least one column will satisfy the constraints
data.frame(
  V1 = c(1881, 1457, 1907, 1442, 1433, 1402, 1968, 1985, 1992),
  V2 = c(1470, 1893, 1444, 1443, 1918, 1949, 1969, 1986, 1993),
  V3 = c(1880, 1894, 1906, 1908, 1432, 1401, 1967, 1984, 1991)
) %>%
  dplyr::mutate(
    ordered = reduce(., ~ifelse(.x >= lag(.x, default = 0), .x, .y))
  )

#>     V1   V2   V3 ordered
#> 1 1881 1470 1880    1881
#> 2 1457 1893 1894    1893
#> 3 1907 1444 1906    1907
#> 4 1442 1443 1908    1908
#> 5 1433 1918 1432    1918
#> 6 1402 1949 1401    1949
#> 7 1968 1969 1967    1968
#> 8 1985 1986 1984    1985
#> 9 1992 1993 1991    1992

If you're not sure if you've returned enough columns from the nearest neighbor search, you'll have to add one more iteration to check if the ordered column is ascending 如果您不确定是否从最近邻居搜索返回了足够的列,则必须再添加一次迭代以检查有序列是否在升序

search_results <- data.frame(
  V1 = c(1881, 1457, 1907, 1442, 1433, 1402, 1968, 1785, 1992),
  V2 = c(1470, 1893, 1444, 1443, 1918, 1949, 1969, 1786, 1993),
  V3 = c(1880, 1894, 1906, 1908, 1432, 1401, 1967, 1784, 1991)
) %>%
  dplyr::mutate(
    ordered = reduce(., ~ifelse(.x >= lag(.x, default = 0), .x, .y))
  )

with(search_results, any(ordered < lag(ordered, default = 0)))
#> [1] TRUE

Created on 2019-07-19 by the reprex package (v0.3.0) reprex包创建于2019-07-19(v0.3.0)

Since V1 should always be increasing, we can take first value of V1 as reference and subtract all the values from 2nd row by this first_value and take the one which gives the minimum difference. 由于V1应该总是增加,我们可以将V1第一个值作为参考,并通过first_value减去第二行的所有值,并取出给出最小差值的值。 Since, we also want to consider priority one way is to multiply the difference by incremental number. 因为,我们还想考虑优先级的一种方法是将差值乘以增量数。 In this example, I have just multiplied it by integers 1, 2 and 3. So the first difference is multiplied by 1, second by 2 and so on. 在这个例子中,我只是将它乘以整数1,2和3.所以第一个差值乘以1,秒乘以2,依此类推。 More complex methods can be thought of to assign priority if some edge case are found. 如果找到一些边缘情况,可以考虑使用更复杂的方法来分配优先级。

first_value <- search_result$V1[1]
search_result$ordered <- c(first_value, apply(search_result[-1, ], 1, function(x) {
     x <- x[x > first_value]
     x[which.min((x - first_value) * seq_along(x))]
}))

search_result
#     V1   V2   V3 ordered
#1  1350 1351 1349    1350
#2  1390 1391 1389    1390
#3  1411 1410 1940    1411
#4  1437 1438 1913    1437
#5  1444 1907 1445    1444
#6  1895 1456 1894    1456
#7  1895 1456 1894    1456
#8  1467 1466 1884    1467
#9  1478 1477 1479    1478
#10 1500 1499 1501    1500

This also works for the second dataset, consider it as df 这也适用于第二个数据集,将其视为df

first_value <- df$V1[1]
df$ordered <- c(first_value, apply(df[-1, ], 1, function(x) {
     x <- x[x > first_value]
     x[which.min((x - first_value) * seq_along(x))]
}))

df
#    V1   V2   V3 ordered
#1 1881 1470 1880    1881
#2 1457 1893 1894    1893
#3 1907 1444 1906    1907
#4 1442 1443 1908    1908
#5 1433 1918 1432    1918
#6 1402 1949 1401    1949
#7 1968 1969 1967    1968
#8 1985 1986 1984    1985
#9 1992 1993 1991    1992

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM