[英]R - Add a column on a dataset based on values from another dataset
I have a dataset (df1) with a column that contains Remaining_points for each owner我有一个数据集 (df1),其中有一列包含每个所有者的 Remaining_points
Df1: DF1:
Id Owner Remaining_points
00001 John 18
00008 Paul 34
00011 Alba 52
00004 Martha 67
And another one with different id's that contains points Df2另一个具有不同 id 的包含点 Df2
Id Points
00025 17
00076 35
00089 51
00092 68
I need to add to df2 a Owner column with most similar Remaining_points on df1我需要向 df2 添加一个 Owner 列,该列与 df1 上的 Remaining_points 最相似
Desired dataframe:所需 dataframe:
Id Points Owner
00025 17 John
00076 35 Paul
00089 51 Alba
00092 68 Martha
Please, could anyone help me on this?拜托,有人可以帮我吗? I'm used to work with dplyr but any solution would be very appreciated.
我曾经与 dplyr 一起工作,但任何解决方案都将不胜感激。
df1 <- data.frame(ID = c("00001", "00008", "00011", "00004"),
Owner = c("John", "Paul", "Alba", "Martha"),
Remaining_points = c(18, 34, 52, 67))
df2 <- data.frame(ID = c("00025", "00076", "00089", "00092"),
Points = c(17, 35, 51, 68))
ind <- which(apply(abs(outer(df1$Remaining_points,df2$Points, "-")), 2, function(x) x == min(x)), arr.ind = TRUE)
df2$Owner <- df1$Owner[ind[,1]]
df2
ID Points Owner
1 00025 17 John
2 00076 35 Paul
3 00089 51 Alba
4 00092 68 Martha
@tacoman's works well. @tacoman 的效果很好。 But I couldn't resist including a dplyr version.
但我无法抗拒包括 dplyr 版本。 The cross join is doing a similar job to @tacoman's
outer()
.交叉连接正在做与@tacoman 的
outer()
类似的工作。
df1 <- data.frame(ID_1 = c("00001", "00008", "00011", "00004"),
Owner = c("John", "Paul", "Alba", "Martha"),
Remaining_points = c(18, 34, 52, 67))
df2 <- data.frame(ID_2 = c("00025", "00076", "00089", "00092"),
Points = c(17, 35, 51, 68))
df1 |>
dplyr::full_join(df2, by = character()) |> # This is essentially a cross join b/c no key is used.
dplyr::mutate(
distance = abs(Points - Remaining_points), # Find the difference in all possibilities
) |>
dplyr::group_by(ID_2) |> # Isolate each ID in its own sub-dataset
dplyr::mutate(
rank = dplyr::row_number(distance), # Rank the distances. The closest will be '1'.
) |>
dplyr::filter(rank == 1L) |> # Keep only the closest
dplyr::ungroup() |>
dplyr::select(
ID_2,
Points,
Owner
)
Result:结果:
# A tibble: 4 x 3
ID_2 Points Owner
<chr> <dbl> <chr>
1 00025 17 John
2 00076 35 Paul
3 00089 51 Alba
4 00092 68 Martha
This is the intermediate result (before removing the extra rows and columns):这是中间结果(在删除额外的行和列之前):
# A tibble: 16 x 7
# Groups: ID_2 [4]
ID_1 Owner Remaining_points ID_2 Points distance rank
<chr> <chr> <dbl> <chr> <dbl> <dbl> <int>
1 00001 John 18 00025 17 1 1 # <- closest for John
2 00001 John 18 00076 35 17 2
3 00001 John 18 00089 51 33 4
4 00001 John 18 00092 68 50 4
5 00008 Paul 34 00025 17 17 2
6 00008 Paul 34 00076 35 1 1 # <- closest for Paul
7 00008 Paul 34 00089 51 17 3
8 00008 Paul 34 00092 68 34 3
9 00011 Alba 52 00025 17 35 3
10 00011 Alba 52 00076 35 17 3
11 00011 Alba 52 00089 51 1 1 # <- closest for Alba
12 00011 Alba 52 00092 68 16 2
13 00004 Martha 67 00025 17 50 4
14 00004 Martha 67 00076 35 32 4
15 00004 Martha 67 00089 51 16 2
16 00004 Martha 67 00092 68 1 1 # <- closest for Martha
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.