![](/img/trans.png)
[英]How to make a weighted frequency table with multiple variables (NOT crossed) in R?
[英]Weighted Frequency Table in R
我希望在 R 中為某個變量 (INTERVIEW_DAY) 創建一個頻率表,但考慮到另一個變量作為權重 (WEIGHT)。
我試圖用包data.table
做到這data.table
。 不過,我很想用 R-Base 包來做。
下面你會找到我想要的表格類型,但仍然沒有加權,這就是我想要學習的。
數據(變量 TUCASEID、INTERVIEW_DAY 和 WEIGHT):
TUCASEID INTERVIEW_DAY WEIGHT
1 2.00301e+13 5 8155462.7
2 2.00301e+13 6 1735322.5
3 2.00301e+13 6 3830527.5
4 2.00301e+13 4 6622023.0
5 2.00301e+13 4 3068387.3
6 2.00301e+13 4 3455424.9
7 2.00301e+13 1 1637826.3
8 2.00301e+13 2 6574426.8
9 2.00301e+13 6 1528296.3
10 2.00301e+13 4 4277052.8
11 2.00301e+13 6 1961482.3
12 2.00301e+13 7 505227.2
13 2.00301e+13 6 2135476.8
14 2.00301e+13 3 5366309.3
15 2.00301e+13 6 1058351.1
使用打包的 data.table 創建表:
df <- setDT(df)
df_freq_table <- df[,.(Freq = .N), by = INTERVIEW_DAY][, Prop := Freq / sum(Freq)][, Cum := cumsum(100 * Prop / sum(Prop))]
我的輸出:df_freq_table[]
INTERVIEW_DAY Freq Prop Cum
1: 5 1 0.06666667 6.666667
2: 6 6 0.40000000 46.666667
3: 4 4 0.26666667 73.333333
4: 1 1 0.06666667 80.000000
5: 2 1 0.06666667 86.666667
6: 7 1 0.06666667 93.333333
7: 3 1 0.06666667 100.000000
在base R
,我們可以使用xtabs/prop.table
。 根據 OP 的代碼, cumsum
是根據 'INTERVIEW_DAY' 中唯一值的出現順序計算的。 因此,要避免sort
ING基於整數值,皈依factor
與levels
規定,獲得sum
“重量”通過“INTERVIEW_DAY”與xtabs
,使用prop.table
返回的比例,然后應用cumsum
上輸出
df$INTERVIEW_DAY <- factor(df$INTERVIEW_DAY, levels = unique(df$INTERVIEW_DAY))
tbl1 <- xtabs(WEIGHT ~ INTERVIEW_DAY, df)
Prop <- prop.table(tbl1)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
# 5 6 4 1 2 7 3
# 15.71029 39.30705 72.86967 76.02470 88.68935 89.66260 100.00000
out <- data.frame(INTERVIEW_DAY = names(tbl1), Freq = as.numeric(tbl1),
Prop = as.numeric(Prop), Cum = as.numeric(Cum))
row.names(out) <- NULL
out
# INTERVIEW_DAY Freq Prop Cum
#1 5 8155462.7 0.157102906 15.71029
#2 6 12249456.5 0.235967631 39.30705
#3 4 17422888.0 0.335626124 72.86967
#4 1 1637826.3 0.031550297 76.02470
#5 2 6574426.8 0.126646592 88.68935
#6 7 505227.2 0.009732453 89.66260
#7 3 5366309.3 0.103373998 100.00000
如果我們需要加權頻率,請使用count
library(dplyr)
df %>%
mutate(INTERVIEW_DAY = factor(INTERVIEW_DAY, levels = unique(INTERVIEW_DAY))) %>%
count(INTERVIEW_DAY, wt = WEIGHT, sort = FALSE) %>%
mutate(Prop = n / sum(n),
Cum = cumsum(100 * Prop/sum(Prop)))
# A tibble: 7 x 4
# INTERVIEW_DAY n Prop Cum
# <fct> <dbl> <dbl> <dbl>
#1 5 8155463. 0.157 15.7
#2 6 12249456. 0.236 39.3
#3 4 17422888 0.336 72.9
#4 1 1637826. 0.0316 76.0
#5 2 6574427. 0.127 88.7
#6 7 505227. 0.00973 89.7
#7 3 5366309. 0.103 100.
或者使用data.table
library(data.table)
setDT(df)[, .(Freq = sum(WEIGHT)), by = INTERVIEW_DAY
][, Prop := Freq / sum(Freq)][, Cum := cumsum(100 * Prop / sum(Prop))][]
# INTERVIEW_DAY Freq Prop Cum
#1: 5 8155462.7 0.157102906 15.71029
#2: 6 12249456.5 0.235967631 39.30705
#3: 4 17422888.0 0.335626124 72.86967
#4: 1 1637826.3 0.031550297 76.02470
#5: 2 6574426.8 0.126646592 88.68935
#6: 7 505227.2 0.009732453 89.66260
#7: 3 5366309.3 0.103373998 100.00000
df <- structure(list(TUCASEID = c(2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13, 2.00301e+13,
2.00301e+13, 2.00301e+13), INTERVIEW_DAY = c(5L, 6L, 6L, 4L,
4L, 4L, 1L, 2L, 6L, 4L, 6L, 7L, 6L, 3L, 6L), WEIGHT = c(8155462.7,
1735322.5, 3830527.5, 6622023, 3068387.3, 3455424.9, 1637826.3,
6574426.8, 1528296.3, 4277052.8, 1961482.3, 505227.2, 2135476.8,
5366309.3, 1058351.1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.