簡體   English   中英

R - 根據另一個數據集的值在數據集上添加一列

[英]R - Add a column on a dataset based on values from another dataset

我有一個數據集 (df1),其中有一列包含每個所有者的 Remaining_points

DF1:

Id      Owner   Remaining_points
00001   John    18
00008   Paul    34
00011   Alba    52
00004   Martha  67

另一個具有不同 id 的包含點 Df2

Id      Points
00025   17
00076   35
00089   51
00092   68

我需要向 df2 添加一個 Owner 列,該列與 df1 上的 Remaining_points 最相似

所需 dataframe:

Id      Points  Owner
00025   17      John
00076   35      Paul
00089   51      Alba
00092   68      Martha

拜托,有人可以幫我嗎? 我曾經與 dplyr 一起工作,但任何解決方案都將不勝感激。

df1 <- data.frame(ID = c("00001", "00008", "00011", "00004"),
                  Owner = c("John", "Paul", "Alba", "Martha"),
                  Remaining_points = c(18, 34, 52, 67))

df2 <- data.frame(ID = c("00025", "00076", "00089", "00092"),
                  Points = c(17, 35, 51, 68))

ind <- which(apply(abs(outer(df1$Remaining_points,df2$Points, "-")), 2, function(x) x == min(x)), arr.ind = TRUE)
df2$Owner <- df1$Owner[ind[,1]]
df2
     ID Points  Owner
1 00025     17   John
2 00076     35   Paul
3 00089     51   Alba
4 00092     68 Martha

@tacoman 的效果很好。 但我無法抗拒包括 dplyr 版本。 交叉連接正在做與@tacoman 的outer()類似的工作。

df1 <- data.frame(ID_1 = c("00001", "00008", "00011", "00004"),
                  Owner = c("John", "Paul", "Alba", "Martha"),
                  Remaining_points = c(18, 34, 52, 67))

df2 <- data.frame(ID_2 = c("00025", "00076", "00089", "00092"),
                  Points = c(17, 35, 51, 68))


df1 |> 
  dplyr::full_join(df2, by = character()) |>    # This is essentially a cross join b/c no key is used.
  dplyr::mutate(
    distance  = abs(Points - Remaining_points), # Find the difference in all possibilities
  ) |> 
  dplyr::group_by(ID_2) |>                      # Isolate each ID in its own sub-dataset 
  dplyr::mutate(
    rank      = dplyr::row_number(distance),    # Rank the distances. The closest will be '1'.
  ) |> 
  dplyr::filter(rank == 1L) |>                  # Keep only the closest
  dplyr::ungroup() |> 
  dplyr::select(
    ID_2,
    Points,
    Owner
  )

結果:

# A tibble: 4 x 3
  ID_2  Points Owner 
  <chr>  <dbl> <chr> 
1 00025     17 John  
2 00076     35 Paul  
3 00089     51 Alba  
4 00092     68 Martha

這是中間結果(在刪除額外的行和列之前):

# A tibble: 16 x 7
# Groups:   ID_2 [4]
   ID_1  Owner  Remaining_points ID_2  Points distance  rank
   <chr> <chr>             <dbl> <chr>  <dbl>    <dbl> <int>
 1 00001 John                 18 00025     17        1     1 # <- closest for John
 2 00001 John                 18 00076     35       17     2
 3 00001 John                 18 00089     51       33     4
 4 00001 John                 18 00092     68       50     4
 5 00008 Paul                 34 00025     17       17     2
 6 00008 Paul                 34 00076     35        1     1 # <- closest for Paul
 7 00008 Paul                 34 00089     51       17     3
 8 00008 Paul                 34 00092     68       34     3
 9 00011 Alba                 52 00025     17       35     3
10 00011 Alba                 52 00076     35       17     3
11 00011 Alba                 52 00089     51        1     1 # <- closest for Alba
12 00011 Alba                 52 00092     68       16     2
13 00004 Martha               67 00025     17       50     4
14 00004 Martha               67 00076     35       32     4
15 00004 Martha               67 00089     51       16     2
16 00004 Martha               67 00092     68        1     1 # <- closest for Martha

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM