简体   繁体   English

计算满足某些条件的数据帧中的行,并按数据帧第一列中的唯一值对它们进行分组

[英]Counting the rows in a data frame that satisfy some criteria and grouping them by the unique values in the first column of the data frame

I have a data with househould id's, gender and age as follows: 我有一个带有家庭住户编号,性别和年龄的数据,如下所示:

mydata <- 

structure(list(ID_HH = c(1,1,1,1,2,2,3,3,3,4,5,5), 
                           GENDER = c(1,2,1,1,1,2,2,1,2,2,1,1), 
                           AGE = c(50,45,3,15,25,5,32,30,10,28,64,16)), 
                      .Names = c("ID", "GENDER", "AGE"), 
                      class = "data.frame", row.names = c(NA, -12L))

   mydata

#  HH_ID GENDER AGE
# 1  1    1    50
# 2  1    2    45
# 3  1    1    3
# 4  1    1    15
# 5  2    1    25
# 6  2    2    5
# 7  3    2    32
# 8  3    1    30
# 9  3    2    10
# 10 4    2    28
# 11 5    1    64
# 12 5    1    16

I have another dataframe, lets call it 'output', that has only the unique HH_ID values and some other columns next to it. 我有另一个数据框,称之为“输出”,它只有唯一的HH_ID值和旁边的其他列。 What i would like to do is to add new columns to this data frame that shows: 我想做的是向此数据框添加新列,该列显示:

  • "the number of adult females (Gender=2 && Age=18)", “成年女性的数量(性别= 2 &&年龄= 18)”,
  • "the number of adult males (Gender=1 && Age=18)", “成年男性人数(性别= 1 &&年龄= 18)”,
  • "the number of school children (6-18)" (Num_Sch), and “在校儿童人数(6-18岁)”(Num_Sch),以及
  • "the number of preschpol children (0-6)"(Num_PreSch) “ preschpol子级数(0-6)”(Num_PreSch)

for each household. 每个家庭。 So 'output' should look like that: 因此“输出”应如下所示:

    #  HH_ID Col1 Col2 ... Num_Fem Num_Male Num_PreSch Num_Sch
# 1  1    ..              1       1         1        1 
# 2  2    ..              0       1         1        0 
# 3  3    ..              1       1         0        1
# 4  4    ..              1       0         0        0
# 5  5    ..              0       1         0        1

I tried many different functions and packages but nothing could achieve exactly what I want. 我尝试了许多不同的功能和软件包,但没有任何东西可以完全实现我想要的功能。 I would appreciate any help or comment. 我将不胜感激任何帮助或评论。

There could be a fancy way to do that, but you can simply do it using a for loop, as follows: 可能有一种很不错的方法,但是您可以使用for循环来完成它,如下所示:

mydata  <- as.data.frame(mydata)
Num_Fem <- Num_Male <- Num_PreSch <- Num_Sch <- c()

for(ID_HH in output$ID_HH){
  curr_HH    <- mydata[mydata$ID_HH == ID_HH,]

  Num_Fem    <- c(Num_Fem,    nrow(curr_HH[curr_HH$GENDER==2 & curr_HH$AGE>=18,]))
  Num_Male   <- c(Num_Male,   nrow(curr_HH[curr_HH$GENDER==1 & curr_HH$AGE>=18,]))
  Num_PreSch <- c(Num_PreSch, nrow(curr_HH[curr_HH$AGE<6,]))
  Num_Sch    <- c(Num_Sch,    nrow(curr_HH[curr_HH$AGE>=6 & curr_HH$AGE<18,]))
}

output <- cbind(output, data.frame(Num_Fem, Num_Male, Num_PreSch, Num_Sch))


It will give you your expected results: 它会给您您预期的结果:

    #  HH_ID Col1 Col2 ... Num_Fem Num_Male Num_PreSch Num_Sch
# 1        1   ..   ..           1        1         1        1 
# 2        2   ..   ..           0        1         1        0 
# 3        3   ..   ..           1        1         0        1
# 4        4   ..   ..           1        0         0        0
# 5        5   ..   ..           0        1         0        1

Hope it helps. 希望能帮助到你。

You're already thinking about this in a way that translates well to logical statements (eg is this person female and 18 or over), so I'd do it with a series of logical vectors, utilizing the fact that because true/false translates to 1/0, you can sum them. 您已经在考虑将其很好地转换为逻辑陈述的方式(例如,此人是18岁及以上的女性),所以我将利用一系列逻辑向量来做到这一点,因为事实是,true / false会翻译到1/0,您可以将它们相加。

Set up the different categories and create logical columns for each. 设置不同的类别并为每个类别创建逻辑列。

library(tidyverse)

mydata %>%
  mutate(adult_female = (GENDER == 2 & AGE >= 18),
         adult_male = (GENDER == 1 & AGE >= 18),
         school = between(AGE, 6, 18),
         preschool = between(AGE, 0, 6))
#>    ID GENDER AGE adult_female adult_male school preschool
#> 1   1      1  50        FALSE       TRUE  FALSE     FALSE
#> 2   1      2  45         TRUE      FALSE  FALSE     FALSE
#> 3   1      1   3        FALSE      FALSE  FALSE      TRUE
#> 4   1      1  15        FALSE      FALSE   TRUE     FALSE
#> 5   2      1  25        FALSE       TRUE  FALSE     FALSE
#> 6   2      2   5        FALSE      FALSE  FALSE      TRUE
#> 7   3      2  32         TRUE      FALSE  FALSE     FALSE
#> 8   3      1  30        FALSE       TRUE  FALSE     FALSE
#> 9   3      2  10        FALSE      FALSE   TRUE     FALSE
#> 10  4      2  28         TRUE      FALSE  FALSE     FALSE
#> 11  5      1  64        FALSE       TRUE  FALSE     FALSE
#> 12  5      1  16        FALSE      FALSE   TRUE     FALSE

Then you can group by household and sum all the columns of the type logical. 然后,您可以按家庭分组并汇总逻辑类型的所有列。

mydata %>%
  mutate(adult_female = (GENDER == 2 & AGE >= 18),
         adult_male = (GENDER == 1 & AGE >= 18),
         school = between(AGE, 6, 18),
         preschool = between(AGE, 0, 6)) %>%
  group_by(ID) %>%
  summarise_if(is.logical, sum)
#> # A tibble: 5 x 5
#>      ID adult_female adult_male school preschool
#>   <dbl>        <int>      <int>  <int>     <int>
#> 1     1            1          1      1         1
#> 2     2            0          1      0         1
#> 3     3            1          1      1         0
#> 4     4            1          0      0         0
#> 5     5            0          1      1         0

One issue that I'll let you handle: the function between is inclusive of its endpoints. 让我处理的一个问题: between的功能包括其端点。 You've described preschool as ages 0 to 6, and school-aged as ages 6 to 18. That means 6 year olds are counted in both. 您已经将学龄前儿童描述为0到6岁,学龄前儿童是6到18岁。这意味着这6岁的孩子都算在内。 You probably want to adjust those endpoints, which shouldn't be too hard since it seems you're working with age as an integer. 您可能想要调整这些端点,这应该不太困难,因为似乎您使用的是整数年龄。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM