简体   繁体   中英

R - Add a column on a dataset based on values from another dataset

I have a dataset (df1) with a column that contains Remaining_points for each owner

Df1:

Id      Owner   Remaining_points
00001   John    18
00008   Paul    34
00011   Alba    52
00004   Martha  67

And another one with different id's that contains points Df2

Id      Points
00025   17
00076   35
00089   51
00092   68

I need to add to df2 a Owner column with most similar Remaining_points on df1

Desired dataframe:

Id      Points  Owner
00025   17      John
00076   35      Paul
00089   51      Alba
00092   68      Martha

Please, could anyone help me on this? I'm used to work with dplyr but any solution would be very appreciated.

df1 <- data.frame(ID = c("00001", "00008", "00011", "00004"),
                  Owner = c("John", "Paul", "Alba", "Martha"),
                  Remaining_points = c(18, 34, 52, 67))

df2 <- data.frame(ID = c("00025", "00076", "00089", "00092"),
                  Points = c(17, 35, 51, 68))

ind <- which(apply(abs(outer(df1$Remaining_points,df2$Points, "-")), 2, function(x) x == min(x)), arr.ind = TRUE)
df2$Owner <- df1$Owner[ind[,1]]
df2
     ID Points  Owner
1 00025     17   John
2 00076     35   Paul
3 00089     51   Alba
4 00092     68 Martha

@tacoman's works well. But I couldn't resist including a dplyr version. The cross join is doing a similar job to @tacoman's outer() .

df1 <- data.frame(ID_1 = c("00001", "00008", "00011", "00004"),
                  Owner = c("John", "Paul", "Alba", "Martha"),
                  Remaining_points = c(18, 34, 52, 67))

df2 <- data.frame(ID_2 = c("00025", "00076", "00089", "00092"),
                  Points = c(17, 35, 51, 68))


df1 |> 
  dplyr::full_join(df2, by = character()) |>    # This is essentially a cross join b/c no key is used.
  dplyr::mutate(
    distance  = abs(Points - Remaining_points), # Find the difference in all possibilities
  ) |> 
  dplyr::group_by(ID_2) |>                      # Isolate each ID in its own sub-dataset 
  dplyr::mutate(
    rank      = dplyr::row_number(distance),    # Rank the distances. The closest will be '1'.
  ) |> 
  dplyr::filter(rank == 1L) |>                  # Keep only the closest
  dplyr::ungroup() |> 
  dplyr::select(
    ID_2,
    Points,
    Owner
  )

Result:

# A tibble: 4 x 3
  ID_2  Points Owner 
  <chr>  <dbl> <chr> 
1 00025     17 John  
2 00076     35 Paul  
3 00089     51 Alba  
4 00092     68 Martha

This is the intermediate result (before removing the extra rows and columns):

# A tibble: 16 x 7
# Groups:   ID_2 [4]
   ID_1  Owner  Remaining_points ID_2  Points distance  rank
   <chr> <chr>             <dbl> <chr>  <dbl>    <dbl> <int>
 1 00001 John                 18 00025     17        1     1 # <- closest for John
 2 00001 John                 18 00076     35       17     2
 3 00001 John                 18 00089     51       33     4
 4 00001 John                 18 00092     68       50     4
 5 00008 Paul                 34 00025     17       17     2
 6 00008 Paul                 34 00076     35        1     1 # <- closest for Paul
 7 00008 Paul                 34 00089     51       17     3
 8 00008 Paul                 34 00092     68       34     3
 9 00011 Alba                 52 00025     17       35     3
10 00011 Alba                 52 00076     35       17     3
11 00011 Alba                 52 00089     51        1     1 # <- closest for Alba
12 00011 Alba                 52 00092     68       16     2
13 00004 Martha               67 00025     17       50     4
14 00004 Martha               67 00076     35       32     4
15 00004 Martha               67 00089     51       16     2
16 00004 Martha               67 00092     68        1     1 # <- closest for Martha

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM