在 R 中越来越多地将多列重新编码为数字

Question

I have 50 columns of names, but here I have presented only 4 columns for convenience.我有 50 列名称，但为了方便，这里我只显示了 4 列。

Name1       Name2         Name3      Name4
Rose,Ali    Van,Hall      Ghol,Dam   Murr,kate
Camp,Laura  Ka,Klo        Dan,Dan    Ali,Hoss
Rose,Ali    Van,Hall      Ghol,Dam   Kol,Kan
Murr,Kate   Ismal, Ismal  Sian,Rozi  Nas,Ami
Ghol,Dam    Ka,Klo        Rose,Ali   Nor,Ko
Murr,Kate   Ismal, Ismal  Dan,Dan    Nas,Ami

I want to assign numbers to each person based on the columns, a sequence of numbers.我想根据列（一系列数字）为每个人分配数字。

For example, in Name 1, we get the numbers from 1-4.例如，在 Name 1 中，我们获取 1-4 中的数字。 The repeated names will get the same numbers.重复的名字会得到相同的数字。

In Name 2, it should be started from 5 and so on.在名称 2 中，它应该从 5 开始，依此类推。 This will give me the following table:这将给我下表：

Assign1 Assian2 Assian3 Assian4
      1       5       8      12
      2       6       9      13
      1       5       8      14
      3       7      10      15
      4       6      11      17
      3       7       9      15

I would like to have it without a loop, ie, sapply ,ie, sapply(dat, function(x) match(x, unique(x))) .我希望它没有循环，即sapply ，即sapply(dat, function(x) match(x, unique(x))) 。

Using dplyr or tidyverse would be great.使用 dplyr 或 tidyverse 会很棒。

Answer 1

A tidyverse solution with purrr::accumulate() :使用purrr::accumulate()的tidyverse解决方案：

library(tidyverse)

df %>%
  mutate(as_tibble(
    accumulate(across(Name1:Name4, ~ match(.x, unique(.x))), ~ .y + max(.x))
  ))

#   Name1 Name2 Name3 Name4
# 1     1     5     8    12
# 2     2     6     9    13
# 3     1     5     8    14
# 4     3     7    10    15
# 5     4     6    11    16
# 6     3     7     9    15

Answer 2

Because the values in each column depend on the values in the previous column, the calculations have to be done sequentially.由于每列中的值取决于前一列中的值，因此必须按顺序进行计算。 This is probably most succinctly achieved by a loop.这可能是通过循环最简洁地实现的。 Remember that lapply and sapply are simply loops-in-disguise, and won't be quicker than an explicit loop.请记住， lapply和sapply只是变相循环，不会比显式循环更快。

Note that your expected output has a mistake in it (there is a number 17 which should be 16)请注意，您预期的 output 中有一个错误（数字 17 应该是 16）

output <- setNames(df, paste0('Assign', seq_along(df)))
                   
for(i in seq_along(output)) {
  output[[i]] <- match(output[[i]], unique(output[[i]]))
  if(i > 1) output[[i]] <- output[[i]] + max(output[[i - 1]])
}

output
#>    Assign1  Assign2  Assign3  Assign4
#> 1        1        5        8       12
#> 2        2        6        9       13
#> 3        1        5        8       14
#> 4        3        7       10       15
#> 5        4        6       11       16
#> 6        3        7        9       15

Edit编辑

If you really want it without an explicit loop, you can do:如果你真的想要它而没有显式循环，你可以这样做：

res <- sapply(seq_along(df), \(i) match(df[[i]], unique(df[[i]]))) 
res + t(replicate(nrow(df), head(c(0, cumsum(apply(res, 2, max))), -1))) |>
  as.data.frame() |>
  setNames(paste0('Assign', seq_along(df)))
#>   Assign1 Assign2 Assign3 Assign4
#> 1       1       5       8      12
#> 2       2       6       9      13
#> 3       1       5       8      14
#> 4       3       7      10      15
#> 5       4       6      11      16
#> 6       3       7       9      15

^{Created on 2023-01-13 with reprex v2.0.2}^{创建于 2023-01-13，使用reprex v2.0.2}

Data taken from question in reproducible format以可复制格式从问题中获取的数据

df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali", 
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall", 
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi", 
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan", 
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L), 
class = "data.frame")

Answer 3

Here is a tidyverse approach:这是一个tidyverse方法：

First paste the column name after each of the strings in all your columns, for sorting purpose later.首先将列名paste在所有列中的每个字符串之后，以便稍后进行排序。 Then pivot it into a two-column df so that we can assign ID to them by match .然后将pivot放入两列 df 中，以便我们可以通过match为它们分配 ID。 Finally pivot it back to a wide format and unnest the list columns.最后pivot它回到宽格式并取消嵌套列表列。

library(tidyverse)

df %>% 
  mutate(across(everything(), ~ paste0(.x, "_", cur_column()))) %>% 
  pivot_longer(everything(), names_to = "ab", values_to = "a") %>% 
  arrange(ab) %>% 
  mutate(b = match(a, unique(a)), .keep = "unused") %>% 
  pivot_wider(names_from = "ab", values_from = "b") %>% 
  unnest(everything())

# A tibble: 6 × 4
  Name1 Name2 Name3 Name4
  <int> <int> <int> <int>
1     1     5     8    12
2     2     6     9    13
3     1     5     8    14
4     3     7    10    15
5     4     6    11    16
6     3     7     9    15

Data数据

Taken from @Allan Cameron.取自@Allan Cameron。

df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali", 
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall", 
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi", 
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan", 
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L), 
class = "data.frame")

Answer 4

Update: The approach below is not ideal because ID's are not unique.更新：下面的方法并不理想，因为 ID 不是唯一的。 Sorry.对不起。

Using a lookup table with tidyverse :使用带有tidyverse的查找表：

library(dplyr)
library(tidyr)

lookup <-
  df |> 
  pivot_longer(everything()) |>
  distinct() |>
  arrange(name) |>
  transmute(name = value, value = row_number()) |>
  deframe()

df |>
  mutate(across(everything(), ~ recode(., !!!lookup)))

Output: Output：

  Name1 Name2 Name3 Name4
1     1     5     4    12
2     2     6     9    13
3     1     5     4    14
4     3     7    10    15
5     4     6     1    16
6     3     7     9    15

Data from @Allan Cameron, thanks.来自@Allan Cameron 的数据，谢谢。

Answer 5

A shorter way could be:更短的方法可能是：

colnames(df) <- map(seq(ncol(df)), function(n) paste0('assign', n))

在 R 中越来越多地将多列重新编码为数字

问题描述

4 个解决方案

解决方案1
6 已采纳 2023-01-13 12:03:04

解决方案2
4 2023-01-13 11:45:21

解决方案3
1 2023-01-13 11:55:13

Data数据

解决方案4
1 2023-01-13 12:07:36

解决方案5
-2 2023-01-13 11:59:02

在 R 中越来越多地将多列重新编码为数字

问题描述

4 个解决方案

解决方案1 6 已采纳 2023-01-13 12:03:04

解决方案2 4 2023-01-13 11:45:21

解决方案3 1 2023-01-13 11:55:13

Data数据

解决方案4 1 2023-01-13 12:07:36

解决方案5 -2 2023-01-13 11:59:02

解决方案1
6 已采纳 2023-01-13 12:03:04

解决方案2
4 2023-01-13 11:45:21

解决方案3
1 2023-01-13 11:55:13

解决方案4
1 2023-01-13 12:07:36

解决方案5
-2 2023-01-13 11:59:02