如何根據 R 中數據框中其他一些列的模式創建新列

Question

我有一個這樣的數據框：

ID	w1	w2	w3	w4	w5	w6
11	光	光	光	光	光	光
22	光	光	光	光	中等的	中等的
33	光	光	中等的	中等的	中等的	重的
44	光	光	中等的	北美	北美	北美
55	光	光	中等的	中等的	北美	北美
66	中等的	中等的	中等的	北美	北美	北美

我想獲取 w1-w6 中每個 id 的輕、中、重的頻率計數。 我會把 w1-w6 的模式作為一個新列。

目標 df 應如下所示：

ID	w1	w2	w3	w4	w5	w6	N_light	N_medium	N_重	最終的
11	光	光	光	光	光	光	6個	0	0	光
22	光	光	光	光	中等的	中等的	4個	2個	0	光
33	光	光	中等的	中等的	中等的	重的	2個	3個	1個	中等的
44	光	光	中等的	北美	北美	北美	2個	1個	0	光
55	光	光	中等的	中等的	北美	北美	2個	2個	0	光
66	中等的	中等的	中等的	北美	北美	北美	0	3個	0	中等的

真實的數據框有數百萬行。 我很難找到一種有效的方法來做到這一點。 有任何想法嗎？

我嘗試了 DescTools 庫中的模式 function，它在 for 循環中使用有限數量的行。 但它運行起來太慢了。

Answer 1

這是一個tidyverse解決方案：

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA` with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  # reorder columns:
  select(id, everything())
  id     w1     w2     w3     w4 N_light N_medium N_heavy
1  1  light  light  light  light       4        0       0
2  2  light  light  light  light       4        0       0
3  3  light  light medium medium       2        2       0
4  4  light  light   <NA> medium       2        1       0
5  5  light  light medium medium       2        2       0
6  6 medium medium   <NA>  heavy       0        2       1

編輯：

如果最終目標是計算三個新列的模式，那么這可能是 go 的一種方式：

# First define a function for the mode:

getmode <- function(v) {
  uniqv <- unique(v[!is.na(v)])
  uniqv[which.max(table(match(v, uniqv)))]
}

# Second, do as before:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA`with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  select(id, everything()) %>%

  # Third, add to this the computation of the mode:
  
  # compute mode:
  summarise(across(starts_with("N"), ~getmode(.)))
  N_light N_medium N_heavy
1       2        2       0

數據：

df <- structure(list(id = 1:6, w1 = c("light", "light", "light", "light", 
                                      "light", "medium"), w2 = c("light", "light", "light", "light", 
                                                                 "light", "medium"), w3 = c("light", "light", "medium", NA, "medium", 
                                                                                            NA), w4 = c("light", "light", "medium", "medium", "medium", "heavy"
                                                                                            )), class = "data.frame", row.names = c(NA, -6L))

Answer 2

在 Base R 你可以這樣做：

a <- table(cbind(dat[1], stack(dat, -1))[1:2])
cbind(dat, as.data.frame.matrix(a), final = colnames(a)[max.col(a)])

   id     w1     w2     w3     w4     w5     w6 heavy light medium  final
11 11  light  light  light  light  light  light     0     6      0  light
22 22  light  light  light  light medium medium     0     4      2  light
33 33  light  light medium medium medium  heavy     1     2      3 medium
44 44  light  light medium   <NA>   <NA>   <NA>     0     2      1  light
55 55  light  light medium medium   <NA>   <NA>     0     2      2 medium
66 66 medium medium medium   <NA>   <NA>   <NA>     0     0      3 medium

Answer 3

我知道這需要dplyr ，但如果其他人發現 base R 有用，你可以簡單地索引並使用*apply函數

xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
 # or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)

df[newnames] <- sapply(xx, 
                       function(x) rowSums(df[,-1] == x, 
                                           na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]

Output：

  id     w1     w2     w3     w4     w5     w6 N_light N_medium N_heavy  final
1 11  light  light  light  light  light  light       6        0       0  light
2 22  light  light  light  light medium medium       4        2       0  light
3 33  light  light medium medium medium  heavy       2        3       1 medium
4 44  light  light medium   <NA>   <NA>   <NA>       2        1       0  light
5 55  light  light medium medium   <NA>   <NA>       2        2       0  light
6 66 medium medium medium   <NA>   <NA>   <NA>       0        3       0 medium

如何根據 R 中數據框中其他一些列的模式創建新列

問題描述

3 個解決方案

解決方案1
0 2023-01-26 17:46:32

解決方案2
0 2023-01-26 18:16:51

解決方案3
0 已采納 2023-01-26 18:35:28

如何根據 R 中數據框中其他一些列的模式創建新列

問題描述

3 個解決方案

解決方案1 0 2023-01-26 17:46:32

解決方案2 0 2023-01-26 18:16:51

解決方案3 0 已采納 2023-01-26 18:35:28

解決方案1
0 2023-01-26 17:46:32

解決方案2
0 2023-01-26 18:16:51

解決方案3
0 已采納 2023-01-26 18:35:28