[英]Summarize proportion of categorical variables and assign dominant categorical variable per group
我有一組來自 map 的示例圖,其中包含多邊形特征(A、B 和 C)和森林類型柵格(冷、暖和熱)。
Plot Polygon Forest
1 A Cold
2 A Cold
3 A Cold
4 A Warm
5 B Cold
6 B Cold
7 C Cold
8 C Warm
9 C Hot
10 C Hot
我想按多邊形總結每種森林類型的比例,並確定每個多邊形中的主要森林類型。 例如:
Polygon Cold Warm Hot Forest_dominant
A 0.75 0.25 0 Cold
B 1 0 0 Cold
C 0.25 0.25 0.5 Hot
這有點令人費解,但也許:
library(tidyverse)
df <- structure(list(Plot = 1:10, Polygon = c("A", "A", "A", "A", "B",
"B", "C", "C", "C", "C"), Forest = c("Cold", "Cold", "Cold",
"Warm", "Cold", "Cold", "Cold", "Warm", "Hot", "Hot")), class = "data.frame", row.names = c(NA,
-10L))
df %>%
group_by(Polygon, Forest) %>%
summarise(n = n()) %>%
mutate(n = n / sum(n)) %>%
group_by(Polygon) %>%
arrange(Polygon, -n) %>%
mutate(Forest_dominant = first(Forest)) %>%
pivot_wider(names_from = Forest, values_from = n, values_fill = 0) %>%
relocate(Forest_dominant, .after = last_col())
#> `summarise()` has grouped output by 'Polygon'. You can override using the `.groups` argument.
#> # A tibble: 3 × 5
#> # Groups: Polygon [3]
#> Polygon Cold Warm Hot Forest_dominant
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 0.75 0.25 0 Cold
#> 2 B 1 0 0 Cold
#> 3 C 0.25 0.25 0.5 Hot
由代表 package (v2.0.1) 於 2021 年 12 月 22 日創建
我們首先可以計算每個組內的比例,然后pivot_wider
library(dplyr)
library(tidyr)
df %>%
group_by(Polygon, Forest) %>%
summarise(n = n()) %>%
mutate(proportion = n/ sum(n),
Forest_dominant = max(proportion), .keep="unused") %>%
pivot_wider(
names_from = Forest,
values_from = proportion,
values_fill = 0
)
Polygon Forest_dominant Cold Warm Hot
<chr> <chr> <dbl> <dbl> <dbl>
1 A Cold 0.75 0.25 0
2 B Cold 1 0 0
3 C Hot 0.25 0.25 0.5
一個基礎 R 選項
reshape(
unique(
transform(
df,
prop = ave(Forest, Polygon, FUN = function(x) table(x)[x] / length(x)),
Forest_dominant = ave(Forest, Polygon, FUN = function(x) names(which.max(table(x))))
)
),
direction = "wide",
idvar = c("Polygon", "Forest_dominant"),
timevar = "Forest"
)
給
Polygon Forest_dominant prop.Cold prop.Warm prop.Hot
1 A Cold 0.75 0.25 <NA>
5 B Cold 1 <NA> <NA>
7 C Hot 0.25 0.25 0.5
或data.table
選項
dcast(
setDT(df)[
,
.(cnt = .N), .(Polygon, Forest)
][
,
`:=`(prop = proportions(cnt), Forest_dominant = Forest[which.max(cnt)]),
Polygon
],
Polygon + Forest_dominant ~ Forest,
value.var = "prop",
fill = 0
)
給
Polygon Forest_dominant Cold Hot Warm
1: A Cold 0.75 0.0 0.25
2: B Cold 1.00 0.0 0.00
3: C Hot 0.25 0.5 0.25
> dput(df)
structure(list(Polygon = c("A", "A", "A", "A", "B", "B", "C",
"C", "C", "C"), Forest = c("Cold", "Cold", "Cold", "Warm", "Cold",
"Cold", "Cold", "Warm", "Hot", "Hot")), row.names = c(NA, -10L
), class = "data.frame")
使用proportions
和which.max
。
with(dat, {
p <- unclass(proportions(table(Polygon, Forest), margin=1))
cbind.data.frame(p, Forest_dominant=colnames(p)[apply(p, 1, which.max)])
})
# Cold Hot Warm Forest_dominant
# A 0.75 0.0 0.25 Cold
# B 1.00 0.0 0.00 Cold
# C 0.25 0.5 0.25 Hot
如果您需要"Polygons"
作為列,則在cbind.data.frame
中包含, Polygon=rownames(p)
。
library(dplyr, warn.conflicts = FALSE)
df %>%
group_by(Polygon) %>%
summarise({
prop.table(table(Forest)) %>%
as.list %>% as_tibble
}) %>%
mutate(
across(-1, coalesce, 0),
Forest_dominant = across(-1) %>% {names(.)[max.col(.)]}
)
#> # A tibble: 3 × 5
#> Polygon Cold Warm Hot Forest_dominant
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 0.75 0.25 0 Cold
#> 2 B 1 0 0 Cold
#> 3 C 0.25 0.25 0.5 Hot
由代表 package (v2.0.1) 於 2021 年 12 月 21 日創建
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.