[英]Dividing a column cell with a different number based on number of observations in a panel long format
I have the following data which is of a panel structure.我有以下面板结构的数据。 I need to normalize each cell so that the observation for a country is divided by total number of observations for that country divided by total number of observations in the panel structure (here 10 - in my data 1100).
我需要对每个单元格进行标准化,以便一个国家的观察结果除以该国家的观察总数除以面板结构中的观察总数(此处为 10 - 在我的数据中为 1100)。 Also I have showcased three countries (AL, UK, FR) but I have 92 in total so I need some general formula (mutate: by = country?).
我还展示了三个国家(AL、UK、FR),但我总共有 92 个,所以我需要一些通用公式(mutate:by = country?)。
This is my data这是我的数据
df1 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2,3,2,3,2,3,2,NA,1,2,1,2,1,2,1,2,1,2,NA,NA,NA,NA,NA,NA,NA,NA,4,NA))
df1
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 2
4 AL 3
5 AL 2
6 AL 3
7 AL 2
8 AL 3
9 AL 2
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 4
30 FR NA
Now, what I want is to divide each cell with number of observations available for each country / total obs like so,现在,我想要的是将每个单元格与每个国家/地区可用的观察数量/总观测值分开,就像这样,
df2 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,
NA,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,
2*10/10,1*10/10,2*10/10,NA,NA,NA,NA,NA,NA,NA,NA,4*1/10,NA))
df2
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 1.4
4 AL 3.7
5 AL 2.7
6 AL 3.7
7 AL 2.7
8 AL 3.7
9 AL 2.7
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 0.4
30 FR NA
I am interested in solving the problem obviously BUT I would really really appreciate it if you could show me how to do this for multiple columns as my original data needs this same operation done for many columns where the country tickers (AL, UK, FR in example) remains the same.我显然有兴趣解决这个问题,但是如果你能告诉我如何对多个列执行此操作,我将非常感激,因为我的原始数据需要对许多国家代码(AL、UK、FR in示例)保持不变。
You can do:你可以做:
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * sum(!is.na(Obs))/n()) %>%
ungroup
# Country Obs
# <chr> <dbl>
# 1 AL NA
# 2 AL NA
# 3 AL 1.4
# 4 AL 2.1
# 5 AL 1.4
# 6 AL 2.1
# 7 AL 1.4
# 8 AL 2.1
# 9 AL 1.4
#10 AL NA
# … with 20 more rows
sum(.is.na(Obs))
counts number of non-NA values in the Country
whereas n()
gives the number of rows for the Country
. sum(.is.na(Obs))
计算Country
中非 NA 值的数量,而n()
给出Country
的行数。
For multiple columns -对于多列 -
df1 %>%
group_by(Country) %>%
mutate(across(col1:col4, ~. * sum(!is.na(.))/n())) %>%
ungroup
This will be applied to col1
to col4
in your dataframe.这将应用于 dataframe 中的
col1
到col4
。
Using data.table
使用
data.table
library(data.table)
setDT(df1)[, Obs := Obs * mean(!is.na(Obs)), County]
Or using dplyr
或使用
dplyr
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * mean(!is.na(Obs)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.