[英]Reduce grouped data based on maximum of multiple columns
我有像這個例子這樣的數據集,但每個輸入有 1000 個輸入和 1000 個字,每個輸入 x 時間 x 字組合有 30 個值(在 cols Copy1..Copy30 中)
df = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,0.00
ark,1,ad,0.00,0.00,0.00,0.00
ark,1,bark,0.00,0.00,0.00,0.00
ark,50,ark,0.00,0.10,0.05,0.00
ark,50,ad,0.00,0.05,0.03,0.00
ark,50,bark,0.07,0.06,0.00,0.00
ark,100,ark,0.00,0.17,0.55,0.00
ark,100,ad,0.00,0.03,0.11,0.00
ark,100,bark,0.05,0.20,0.00,0.00
bark,1,ark,0.00,0.00,0.00,0.00
bark,1,ad,0.00,0.00,0.00,0.00
bark,1,bark,0.00,0.00,0.00,0.00
bark,50,ark,0.00,0.03,0.09,0.00
bark,50,ad,0.00,0.05,0.03,0.00
bark,50,bark,0.2,0.75,0.00,0.00
bark,100,ark,0.00,0.08,0.32,0.00
bark,100,ad,0.00,0.03,0.11,0.00
bark,100,bark,0.21,0.60,0.00,0.00
") %>% arrange(Input,Time,Word)
df
# Input Time Word Copy1 Copy2 Copy3 Copy30
# 1 ark 1 ad 0.00 0.00 0.00 0
# 2 ark 1 ark 0.00 0.00 0.00 0
# 3 ark 1 bark 0.00 0.00 0.00 0
# 4 ark 50 ad 0.00 0.05 0.03 0
# 5 ark 50 ark 0.00 0.10 0.05 0
# 6 ark 50 bark 0.07 0.06 0.00 0
# 7 ark 100 ad 0.00 0.03 0.11 0
# 8 ark 100 ark 0.00 0.17 0.55 0
# 9 ark 100 bark 0.05 0.20 0.00 0
# 10 bark 1 ad 0.00 0.00 0.00 0
# 11 bark 1 ark 0.00 0.00 0.00 0
# 12 bark 1 bark 0.00 0.00 0.00 0
# 13 bark 50 ad 0.00 0.05 0.03 0
# 14 bark 50 ark 0.00 0.03 0.09 0
# 15 bark 50 bark 0.20 0.75 0.00 0
# 16 bark 100 ad 0.00 0.03 0.11 0
# 17 bark 100 ark 0.00 0.08 0.32 0
# 18 bark 100 bark 0.21 0.60 0.00 0
我想按 Input 和 Word 進行分組,並且對於每個組合,確定哪個 Copy 列具有每個單詞的最大值,然后只為該 Input 保留該 Word 的該列。 對上一個問題的回答讓我成為了其中的一部分。 此代碼標識每個單詞的哪個副本是最大值。
max_copy <- df %>%
pivot_longer(starts_with("Copy"), names_to="copy_name", values_to="copy_value") %>%
group_by(Input, Word) %>%
filter(rank(copy_value, ties.method="first") == n()) %>%
group_by(Input, Time)
max_copy
# A tibble: 6 x 5
# Groups: Input, Time [3]
# Input Time Word copy_name copy_value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 100 ad Copy3 0.11
# 2 ark 100 ark Copy3 0.55
# 3 ark 100 bark Copy2 0.2
# 4 bark 50 bark Copy2 0.75
# 5 bark 100 ad Copy3 0.11
# 6 bark 100 ark Copy3 0.32
現在我想要做的是使用它來將數據減少到每個輸入的每個單詞的識別副本,這樣結果將是:
# A tibble: 18 x 5
# Groups: Input, Time [6]
# Input Time Word copy_name copy_value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 1 ad Copy3 0
# 2 ark 1 ark Copy3 0
# 3 ark 1 bark Copy2 0
# 4 ark 50 ad Copy3 0.03
# 5 ark 50 ark Copy3 0.05
# 6 ark 50 bark Copy2 0.06
# 7 ark 100 ad Copy3 0.11
# 8 ark 100 ark Copy3 0.55
# 9 ark 100 bark Copy2 0.2
# 10 bark 1 ad Copy3 0
# 11 bark 1 ark Copy3 0
# 12 bark 1 bark Copy2 0
# 13 bark 50 ad Copy3 0.03
# 14 bark 50 ark Copy3 0.09
# 15 bark 50 bark Copy2 0.75
# 16 bark 100 ad Copy3 0.11
# 17 bark 100 ark Copy3 0.32
# 18 bark 100 bark Copy2 0.6
有沒有一種方法可以像這樣使用 max_copy 數據來減少 df ?
編輯:下面的一些解決方案存在問題。 @akrun 的解決方案如果存在負值(易於處理)或后續副本中存在正值而不是具有最大值的副本(我不知道如何解決此問題),則會中斷。 @AnoushiravanR 的解決方案似乎對這兩種情況都很穩健,@AnilGoyal 的解決方案也是如此。 這是包含這些條件的更新數據集。
df2 = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,-0.01
ark,1,ad,0.00,0.00,0.00,-0.01
ark,1,bark,0.00,0.00,0.00,-0.01
ark,1,bar,0.00,0.00,0.00,-0.01
ark,50,ark,0.00,0.10,0.05,-0.01
ark,50,ad,0.00,0.05,0.03,-0.01
ark,50,bark,0.07,0.06,0.01,-0.01
ark,50,bar,0.07,0.06,0.01,-0.01
ark,100,ark,0.00,0.17,0.55,-0.01
ark,100,ad,0.00,0.03,0.11,-0.01
ark,100,bark,0.05,0.20,0.01,-0.01
ark,100,bar,0.04,0.15,0.01,-0.01
bark,1,ark,0.00,0.00,0.00,-0.01
bark,1,ad,0.00,0.00,0.00,-0.01
bark,1,bark,0.00,0.00,0.00,-0.01
bark,1,bar,0.00,0.00,0.00,-0.01
bark,50,ark,0.00,0.03,0.09,-0.01
bark,50,ad,0.00,0.05,0.03,-0.01
bark,50,bark,0.2,0.75,0.01,0.01
bark,50,bar,0.2,0.7,0.00,-0.01
bark,100,ark,0.00,0.08,0.32,-0.01
bark,100,ad,0.00,0.03,0.11,-0.01
bark,100,bark,0.21,0.60,0.01,-0.01
bark,100,bar,0.15,0.4,0.01,-0.01
") %>% arrange(Input,Time,Word)
df2 所需的 output:
# A tibble: 24 x 5
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 1 ad Copy3 0
# 2 ark 1 ark Copy3 0
# 3 ark 1 bar Copy2 0
# 4 ark 1 bark Copy2 0
# 5 ark 50 ad Copy3 0.03
# 6 ark 50 ark Copy3 0.05
# 7 ark 50 bar Copy2 0.06
# 8 ark 50 bark Copy2 0.06
# 9 ark 100 ad Copy3 0.11
# 10 ark 100 ark Copy3 0.55
# 11 ark 100 bar Copy2 0.15
# 12 ark 100 bark Copy2 0.2
# 13 bark 1 ad Copy3 0
# 14 bark 1 ark Copy3 0
# 15 bark 1 bar Copy2 0
# 16 bark 1 bark Copy2 0
# 17 bark 50 ad Copy3 0.03
# 18 bark 50 ark Copy3 0.09
# 19 bark 50 bar Copy2 0.7
# 20 bark 50 bark Copy2 0.75
# 21 bark 100 ad Copy3 0.11
# 22 bark 100 ark Copy3 0.32
# 23 bark 100 bar Copy2 0.4
# 24 bark 100 bark Copy2 0.6
這可以通過summarise
來完成。 使用pivot_longer
重塑為“long”格式后,按“輸入”、“時間”字進行分組,然后根據條件summarise
以創建“復制值”, if
all
值都為 0,則返回 0, else
返回last
“copy_value”的非零值
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'copy_value') %>%
group_by(Input, Time, Word) %>%
summarise(copy_value = if(all(copy_value == 0)) 0
else last(copy_value[copy_value != 0]), .groups = 'drop')
-輸出
# A tibble: 18 x 4
# Input Time Word copy_value
# * <chr> <int> <chr> <dbl>
# 1 ark 1 ad 0
# 2 ark 1 ark 0
# 3 ark 1 bark 0
# 4 ark 50 ad 0.03
# 5 ark 50 ark 0.05
# 6 ark 50 bark 0.06
# 7 ark 100 ad 0.11
# 8 ark 100 ark 0.55
# 9 ark 100 bark 0.2
#10 bark 1 ad 0
#11 bark 1 ark 0
#12 bark 1 bark 0
#13 bark 50 ad 0.03
#14 bark 50 ark 0.09
#15 bark 50 bark 0.75
#16 bark 100 ad 0.11
#17 bark 100 ark 0.32
#18 bark 100 bark 0.6
如果我們也需要'copy_name',那么在slice
中使用相同的邏輯表達式返回滿足條件的行,即if
all
值都為 0,則返回最后一行( n()
- 無關緊要)或獲取索引copy_value 的last
非零值。 現在,我們通過“輸入”、“單詞”進行分組,並通過將“ mutate
”替換為“copy_value”為max
的相應“copy_name”來改變“copy_name”
df %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'copy_value') %>%
group_by(Input, Time, Word) %>%
arrange(copy_value) %>%
slice(if(all(copy_value <= 0)) n()
else tail(which(copy_value > 0), 1))%>%
group_by(Input, Word) %>%
mutate(copy_name = copy_name[which.max(copy_value)]) %>%
ungroup
-輸出
# A tibble: 18 x 5
# Input Time Word copy_name copy_value
# <chr> <int> <chr> <chr> <dbl>
# 1 ark 1 ad Copy3 0
# 2 ark 1 ark Copy3 0
# 3 ark 1 bark Copy2 0
# 4 ark 50 ad Copy3 0.03
# 5 ark 50 ark Copy3 0.05
# 6 ark 50 bark Copy2 0.06
# 7 ark 100 ad Copy3 0.11
# 8 ark 100 ark Copy3 0.55
# 9 ark 100 bark Copy2 0.2
#10 bark 1 ad Copy3 0
#11 bark 1 ark Copy3 0
#12 bark 1 bark Copy2 0
#13 bark 50 ad Copy3 0.03
#14 bark 50 ark Copy3 0.09
#15 bark 50 bark Copy2 0.75
#16 bark 100 ad Copy3 0.11
#17 bark 100 ark Copy3 0.32
#18 bark 100 bark Copy2 0.6
更新的解決方案
我已經用你的新數據集更新了我的解決方案。 我看不出 output 有什么問題,但如果有什么需要修改的,我很高興知道。
library(dplyr)
library(tidyr)
library(purrr)
df2 %>%
mutate(Copy_value = pmap_dbl(df2 %>% select(Copy1:Copy30), ~ max(c(...))),
Copy_name = pmap(df2 %>% select(Copy1:Copy30), ~
names(c(...)[c(...) == max(c(...))]))) %>%
unnest(Copy_name) %>%
group_by(Input, Word) %>%
mutate(Copy_name = Copy_name[which.max(Copy_value)]) %>%
distinct() %>%
select(-c(Copy1:Copy_value)) %>%
right_join(df2, by = c("Input", "Time", "Word")) %>%
rowwise() %>%
mutate(Copy_value = map_dbl(Copy_name, ~ get({.x}))) %>%
select(-c(Copy1:Copy30))
Output這是新提供的數據集的 output。
Input Time Word Copy_name Copy_value
1 ark 1 ad Copy3 0.00
2 ark 1 ark Copy3 0.00
3 ark 1 bar Copy2 0.00
4 ark 1 bark Copy2 0.00
5 ark 50 ad Copy3 0.03
6 ark 50 ark Copy3 0.05
7 ark 50 bar Copy2 0.06
8 ark 50 bark Copy2 0.06
9 ark 100 ad Copy3 0.11
10 ark 100 ark Copy3 0.55
11 ark 100 bar Copy2 0.15
12 ark 100 bark Copy2 0.20
13 bark 1 ad Copy3 0.00
14 bark 1 ark Copy3 0.00
15 bark 1 bar Copy2 0.00
16 bark 1 bark Copy2 0.00
17 bark 50 ad Copy3 0.03
18 bark 50 ark Copy3 0.09
19 bark 50 bar Copy2 0.70
20 bark 50 bark Copy2 0.75
21 bark 100 ad Copy3 0.11
22 bark 100 ark Copy3 0.32
23 bark 100 bar Copy2 0.40
24 bark 100 bark Copy2 0.60
通過purrr
的另一種方法
df %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
semi_join(df %>% nest(copy_name = !c(Input, Word)) %>%
mutate(copy_name = map_chr(copy_name,
~ names(.x)[1 + which(.x[-1] == max(.x[-1]), arr.ind = T)[2]])),
by = c("Input", "Word", "copy_name")
)
# A tibble: 18 x 5
Input Time Word copy_name Value
<chr> <int> <chr> <chr> <dbl>
1 ark 1 ad Copy3 0
2 ark 1 ark Copy3 0
3 ark 1 bark Copy2 0
4 ark 50 ad Copy3 0.03
5 ark 50 ark Copy3 0.05
6 ark 50 bark Copy2 0.06
7 ark 100 ad Copy3 0.11
8 ark 100 ark Copy3 0.55
9 ark 100 bark Copy2 0.2
10 bark 1 ad Copy3 0
11 bark 1 ark Copy3 0
12 bark 1 bark Copy2 0
13 bark 50 ad Copy3 0.03
14 bark 50 ark Copy3 0.09
15 bark 50 bark Copy2 0.75
16 bark 100 ad Copy3 0.11
17 bark 100 ark Copy3 0.32
18 bark 100 bark Copy2 0.6
實際上,這個 cn 可以分為兩部分——
purrr::map_chr
在其中找到那些副本的名稱,其中副本值對於任何時間值都是最大值。df %>% nest(copy_name = !c(Input, Word)) %>%
mutate(copy_name = map_chr(copy_name,
~ names(.x)[1 + which(.x[-1] == max(.x[-1]), arr.ind = T)[2]]))
# A tibble: 6 x 3
Input Word copy_name
<chr> <chr> <chr>
1 ark ad Copy3
2 ark ark Copy3
3 ark bark Copy2
4 bark ad Copy3
5 bark ark Copy3
6 bark bark Copy2
semi_join
將透視數據與此數據連接回來,這實際上是一個過濾連接。單個 pipe 中的另一種方法
df %>% nest(data = !c(Input, Word)) %>%
mutate(data = map(data, ~ .x %>%
select(Time, 1+which(.x[-1] == max(.x[-1]), arr.ind = T)[2]) %>%
mutate(copy = names(.)[2]) %>%
rename_with(~'value', 2)
)) %>%
unnest(data)
# A tibble: 18 x 5
Input Word Time value copy
<chr> <chr> <int> <dbl> <chr>
1 ark ad 1 0 Copy3
2 ark ad 50 0.03 Copy3
3 ark ad 100 0.11 Copy3
4 ark ark 1 0 Copy3
5 ark ark 50 0.05 Copy3
6 ark ark 100 0.55 Copy3
7 ark bark 1 0 Copy2
8 ark bark 50 0.06 Copy2
9 ark bark 100 0.2 Copy2
10 bark ad 1 0 Copy3
11 bark ad 50 0.03 Copy3
12 bark ad 100 0.11 Copy3
13 bark ark 1 0 Copy3
14 bark ark 50 0.09 Copy3
15 bark ark 100 0.32 Copy3
16 bark bark 1 0 Copy2
17 bark bark 50 0.75 Copy2
18 bark bark 100 0.6 Copy2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.