简体   繁体   中英

R Loop over unique values in a dataframe column to create another one based on conditions

My dataset consists of scores and total respondents for questions asked in a survey, over a number of fiscal years (FY13, FY14 & FY15) and in different regions.

My objective is to loop through the FY column and identify when each question was asked, for each region. And store this information in a new column.

This is what a reproducible sample looks like -

testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
              Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
              QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
              Very.Satisfied=runif(16,min = 0, max=1),
              Total.Very.Satisfied=floor(runif(16,min=10,max=120)),
              Satisfied=runif(16,min = 0, max=1),
              Total.Satisfied=floor(runif(16,min=10,max=120)),
              Dissatisfied=runif(16,min = 0, max=1),
              Total.Dissatisfied=floor(runif(16,min=10,max=120)),
              Very.Dissatisfied=runif(16,min = 0, max=1),
              Total.Very.Dissatisfied=floor(runif(16,min=10,max=120)))

I start with creating an ID column, by concatenating Region & QST

library(tidyr)
testdf = testdf %>%
unite(ID,c('Region','QST'),sep = "",remove = F)

My Objective

1) For each unique ID , identify whether the given question was asked -

a) Only on one year (either FY13, FY14 or FY15)

b) Over the Past Two Years (FY15 & FY14 only)

c) Over the Past Three Years (FY15 & FY14 & FY13)

d) On FY13 & FY15 Only

My Attempt

For this problem, I tried to create a for loop , and for each unique ID , I first store the unique occurences of each FY the question was asked in a vector v . Then using an IF conditional statement I assign a comment to a newly created column called Tally based on these occurences.

for (i in unique(testdf$ID))
{
v=unique(testdf$FY)

  if(('FY15' %in% v) & ('FY14' %in% v)) {
      testdf$Tally=='Asked Over The Past Two Years'
  } 
  else if(('FY15' %in% v) & ('FY14' %in% v) & ('FY13' %in% v)) {
       testdf$Tally=='Asked Over The Past Three Years'
  }
  else if(('FY13' %in% v) & ('FY15' %in% v)) {
        testdf$Tally=='Question Asked in FY13 & FY15 Only'
  }
  else { testdf$Tally=='Question Asked Once Only' 
  }

}  

The loop seems to run without throwing an error message, but it doesn't seem to create the new Tally column.

Any help with this will be greatly appreciated.

In your code the main problem is that in the if-else clause you're not doing an assignment (using '<-') but a comparison, using '=='. Here's a solution that I find more elegant, since it's not using a loop:

require(tidyverse)

testdf %>%
  select(ID, FY) %>%
  unique() %>%
  mutate(is_true = 1) %>%
  spread(key = FY, value = is_true, fill = 0) %>%
  mutate(tally = case_when(
    FY13 == 1 & FY14 == 1 & FY15 == 1 ~ 'Asked Over The Past Three Years',
                FY14 == 1 & FY15 == 1 ~ 'Asked Over the Past Two Years',
    FY13 == 1 &             FY15 == 1 ~ 'Asked in FY12 & FY15 Only',
    TRUE ~ 'Question Asked Once Only'
  ))

Output:

+------------------------------------------------------------+
|          ID FY13 FY14 FY15                           tally |
+------------------------------------------------------------+
| 1  AFRICAQ2    1    1    1 Asked Over The Past Three Years |
| 2  AFRICAQ5    0    1    1   Asked Over the Past Two Years |
| 3 AMERICAQ2    1    1    1 Asked Over The Past Three Years |
| 4 AMERICAQ5    1    1    1 Asked Over The Past Three Years |
| 5    ASIAQ2    1    1    1 Asked Over The Past Three Years |
| 6    ASIAQ5    1    0    1       Asked in FY12 & FY15 Only |
+------------------------------------------------------------+

No need for a loop:

library(tidyverse)

result <- testdf %>%
    select(3, 2, 1) %>%
    mutate(Asked = 1) %>%
    spread(FY, Asked)

> result
  QST  Region FY13 FY14 FY15
1  Q2  AFRICA    1    1    1
2  Q2 AMERICA    1    1    1
3  Q2    ASIA    1    1    1
4  Q5  AFRICA   NA    1    1
5  Q5 AMERICA    1    1    1
6  Q5    ASIA    1   NA    1

Answers all four questions in one go.

If you really want a tally column, expand it like this:

result %>%
    mutate(Tally = case_when(FY13 + FY14 + FY15 == 1 ~ "Only one year",
                             FY13 + FY14 + FY15 == 3 ~ "Past three years",
                             FY14 + FY15 == 2 ~ "Past two years",
                             FY13 + FY15 == 2 ~ "FY13 and FY15 only",
                             NA ~ NA_character_))

  QST  Region FY13 FY14 FY15              Tally
1  Q2  AFRICA    1    1    1   Past three years
2  Q2 AMERICA    1    1    1   Past three years
3  Q2    ASIA    1    1    1   Past three years
4  Q5  AFRICA   NA    1    1     Past two years
5  Q5 AMERICA    1    1    1   Past three years
6  Q5    ASIA    1   NA    1 FY13 and FY15 only

Consider ave for grouping calculation by Region and QST inside nested ifelse for conditional logic:

testdf <- within(testdf, {
                   FY13 <- ifelse(FY=='FY13', 1, 0)
                   FY14 <- ifelse(FY=='FY14', 1, 0)
                   FY15 <- ifelse(FY=='FY15', 1, 0)

                   Tally <- ifelse(ave(FY13, Region, QST, FUN=max) + ave(FY14, Region, QST, FUN=max) + ave(FY15, Region, QST, FUN=max) == 1,
                                   'Asked Only on One Year',
                                   ifelse(ave(FY13, Region, QST, FUN=max) + ave(FY14, Region, QST, FUN=max) + ave(FY15, Region, QST, FUN=max) == 3,
                                          'Asked Over the Past Three Years',
                                          ifelse(ave(FY14, Region, QST, FUN=max) + ave(FY15, Region, QST, FUN=max) == 2,
                                                 'Asked Over the Past Two Years',
                                                 ifelse(ave(FY13, Region, QST, FUN=max) + ave(FY15, Region, QST, FUN=max) == 2,
                                                        'Asked On FY13 & FY15 Only',
                                                        NA
                                                        )
                                                 )
                                          )
                                   )

                   FY13 <- NULL; FY14 <- NULL; FY15 <- NULL
             })

testdf[c("ID", "FY", "Tally")]

#     Region QST   FY                           Tally
# 1   AFRICA  Q2 FY13 Asked Over the Past Three Years
# 2   AFRICA  Q2 FY14 Asked Over the Past Three Years
# 3   AFRICA  Q2 FY15 Asked Over the Past Three Years
# 4   AFRICA  Q5 FY14   Asked Over the Past Two Years
# 5   AFRICA  Q5 FY15   Asked Over the Past Two Years
# 6     ASIA  Q2 FY13 Asked Over the Past Three Years
# 7     ASIA  Q2 FY14 Asked Over the Past Three Years
# 8     ASIA  Q2 FY15 Asked Over the Past Three Years
# 9     ASIA  Q5 FY13       Asked On FY13 & FY15 Only
# 10    ASIA  Q5 FY15       Asked On FY13 & FY15 Only
# 11 AMERICA  Q2 FY13 Asked Over the Past Three Years
# 12 AMERICA  Q2 FY14 Asked Over the Past Three Years
# 13 AMERICA  Q2 FY15 Asked Over the Past Three Years
# 14 AMERICA  Q5 FY13 Asked Over the Past Three Years
# 15 AMERICA  Q5 FY14 Asked Over the Past Three Years
# 16 AMERICA  Q5 FY15 Asked Over the Past Three Years

There's a solution using your ID column. (Using paste0 we can do that somewhat nicer, though with testdf$ID <- paste0(testdf$Region, "_", testdf$QST) .)

We dcast your testdf using the reshape2 package.

library(reshape2)
tmp <- dcast(testdf, ID ~ FY, 
               value.var="QST", fun.aggregate=length)

Now we already know whether the question was asked in the different years. To answer the further questions, we'll do some maths.

tmp <- cbind(tmp, 
             past2=as.numeric(t2[3] + t2[4] == 2 & t2[2] == 0), 
             past3=as.numeric(t2[2] + t2[3] + t2[4] == 3),
             y13_15=as.numeric(t2[2] + t2[4] == 2 & t2[3] == 0))

The sequences in the 5:7 columns contain the desired Tally information that we can milk

tmp$Tally <- apply(tmp, 1, function(x) paste0(x[5:7], collapse=""))

translate into human language by factor levels,

tmp$Tally <- factor(tmp$Tally, labels=c('Question Asked Once Only',
                                        'Question Asked in FY13 & FY15 Only',
                                        'Asked Over The Past Three Years',
                                        'Asked Over The Past Two Years'))

and merge with the original data frame to achieve the desired result.

Result

> merge(testdf, t3[c(1, 8)])
             ID   FY    Region QST                              Tally
1     AFRICA_Q2 FY13    AFRICA  Q2    Asked Over The Past Three Years
2     AFRICA_Q2 FY14    AFRICA  Q2    Asked Over The Past Three Years
3     AFRICA_Q2 FY15    AFRICA  Q2    Asked Over The Past Three Years
4     AFRICA_Q5 FY14    AFRICA  Q5      Asked Over The Past Two Years
5     AFRICA_Q5 FY15    AFRICA  Q5      Asked Over The Past Two Years
6    AMERICA_Q2 FY13   AMERICA  Q2    Asked Over The Past Three Years
7    AMERICA_Q2 FY14   AMERICA  Q2    Asked Over The Past Three Years
8    AMERICA_Q2 FY15   AMERICA  Q2    Asked Over The Past Three Years
9    AMERICA_Q5 FY13   AMERICA  Q5    Asked Over The Past Three Years
10   AMERICA_Q5 FY14   AMERICA  Q5    Asked Over The Past Three Years
11   AMERICA_Q5 FY15   AMERICA  Q5    Asked Over The Past Three Years
12 ANTH.CTRY_Q2 FY15 ANTH.CTRY  Q2           Question Asked Once Only
13      ASIA_Q2 FY13      ASIA  Q2    Asked Over The Past Three Years
14      ASIA_Q2 FY14      ASIA  Q2    Asked Over The Past Three Years
15      ASIA_Q2 FY15      ASIA  Q2    Asked Over The Past Three Years
16      ASIA_Q5 FY13      ASIA  Q5 Question Asked in FY13 & FY15 Only
17      ASIA_Q5 FY15      ASIA  Q5 Question Asked in FY13 & FY15 Only

Data

testdf <- structure(list(FY = c("FY13", "FY14", "FY15", "FY14", "FY15", 
"FY13", "FY14", "FY15", "FY13", "FY15", "FY13", "FY14", "FY15", 
"FY13", "FY14", "FY15", "FY15"), Region = c("AFRICA", "AFRICA", 
"AFRICA", "AFRICA", "AFRICA", "ASIA", "ASIA", "ASIA", "ASIA", 
"ASIA", "AMERICA", "AMERICA", "AMERICA", "AMERICA", "AMERICA", 
"AMERICA", "ANTH.CTRY"), QST = c("Q2", "Q2", "Q2", "Q5", "Q5", 
"Q2", "Q2", "Q2", "Q5", "Q5", "Q2", "Q2", "Q2", "Q5", "Q5", "Q5", 
"Q2")), row.names = c(NA, 17L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM