asssign values to dataframe subset in R

Question

I'm having trouble assigning a dataframe to a subset of another. In the example below, the line

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

only modifies one column instead of two. I would expect it either to modify no columns or both, not only one. I wrote the function to fill in the PrefName and CountyID columns in dataframe ds where they are NA by joining ds to another dataframe cs .

As you can see if you run it, the test is failing because PrefName is not getting filled in. After doing a bit of debugging, I realized that join() is doing exactly what it is expected to do, but the actual assignment of the result of that join somehow drops the PrefName back to a NA .

# fully copy-paste-run-able (but broken) code                                                    
suppressMessages({                                                          
    library("plyr")                                                         
    library("methods")                                                      
    library("testthat")                                                     
}) 

# Fill in the missing PrefName/CountyIDs in delstat                         
#   - Find the missing values in Delstat                                    
#   - Grab the CityState Primary Record values                              
#   - Match on zipcode to fill in the holes in the delstat data             
#   - Remove any codes that could not be fixed                              
#   - @param ds: delstat dataframe with 6 columns (see test case)           
#   - @param cs: citystate dataframe with 6 columns (see test case) 
getMissingCounties <- function(ds, cs) {                                    

    if (length(is.na(ds$CountyID))) {                                       

        cavities <- which(is.na(ds$CountyID))                               
        fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)]                          

        ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

        ds <- ds[!is.na(ds$CountyID),]                                      
    }                                                                       

    return(ds)                                                              
}                                                                           
test_getMissingCounties <- function() {                                     

    ds <- data.frame(                                                       
        CityStateKey = c(1,     2,  3,  4  ),                               
        ZipCode      = c(11,    22, 33, 44 ),                               
        Business     = c(1,     1,  1,  1  ),                               
        Residential  = c(1,     1,  1,  1  ),                               
        PrefName     = c("One", NA , NA, NA),                               
        CountyID     = c(111,   NA,  NA, NA))                               

    cs <- data.frame(                                                       
        ZipCode      = c(11,    22,    22,    33,      55    ),             
        Name         = c("eh",  "eh?", "eh?", "eh!?",  "ah." ),             
        PrefName     = c("One", "To",  "Two", "Three", "Five"),             
        CountyID     = c(111,   222,   222,   333,     555   ),             
        PrimRec      = c(TRUE,  FALSE, TRUE,  TRUE,    TRUE  ),             
        CityStateKey = c(1,     2,     2,     3,       5     ))             

    expected <- data.frame(                                                 
        CityStateKey = c(1,     2,     3      ),                            
        ZipCode      = c(11,    22,    33     ),                            
        Business     = c(1,     1,     1      ),                            
        Residential  = c(1,     1,     1      ),                            
        PrefName     = c("One", "Two", "Three"),                            
        CountyID     = c(111,   222,   333    ))                            

    expect_equal(getMissingCounties(ds, cs), expected)                      
}

# run the test
test_getMissingCounties()

The results are:

CityStateKey ZipCode Business Residential PrefName CountyID
       1       11        1          1       One      111
       2       22        1          1      <NA>      222
       3       33        1          1      <NA>      333

Any ideas why PrefName is getting set to NA by the assignment or how to do the assignment so I don't lose data?

Answer 1

The short answer is that you can avoid this problem by making sure that there are no factors in your data frames. You do this by using stringsAsFactors=FALSE in the call(s) to data.frame(...) . Note that many of the data import functions, including read.table(...) and read.csv(...) also convert character to factor by default. You can defeat this behavior the same way.

This problem is actually quite subtle, and is also a good example of how R's "silent coercion" between data types creates all sorts of problems.

The data.frame(...) function converts any character vectors to factors by default. So in your code ds$PerfName is a factor with one level, and cs$PerfName is a factor with 5 levels. So in your assignment statement:

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

the 5th column on the LHS is a factor with 1 level, and the 5th column on the RHS is a factor with 5 levels.

Under some circumstances , when you assign a factor with more levels to a factor with fewer levels, the missing levels are set to NA . Consider this:

x <- c("A","B",NA,NA,NA)  # character vector          
y <- LETTERS[1:5]         # character vector
class(x); class(y)
# [1] "character"
# [1] "character"

df <- data.frame(x,y)     # x and y coerced to factor
sapply(df,class)          # df$x and df$y are factors
#        x        y 
# "factor" "factor" 

# assign rows 3:5 of col 2 to col 1
df[3:5,1] <- df[3:5,2]    # fails with a warning
# Warning message:
# In `[<-.factor`(`*tmp*`, iseq, value = 3:5) :
#   invalid factor level, NA generated
df                        # missing levels set to NA
#      x y
# 1    A A
# 2    B B
# 3 <NA> C
# 4 <NA> D
# 5 <NA> E

The example above is equivalent to your assignment statement. However, notice what happens if you assign all of column 2 to column 1.

# assign all of col 2 to col 1
df <- data.frame(x,y)
df[,1] <- df[,2]          # succeeds!!
df
#   x y
# 1 A A
# 2 B B
# 3 C C
# 4 D D
# 5 E E

This works.

Finally, a note on debugging: if you are debugging a function, sometimes it is useful to run through the statements line by line at the command line (eg, in the global environment). If you did that, you would have gotten the warning above, whereas inside a function call the warnings are suppressed.

Answer 2

The constraints of the test can be satisfied by reimplementing getMissingCountries with:

merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")

Caveat: the ZipCode column is always emitted first, which differs from your expected result.

But to answer the subassignment question: it breaks, because the level sets of PrefName are incompatible between ds and cs . Either avoid using a factor or relevel them. You might have missed R's warning about this, because testthat was somehow suppressing warnings.

asssign values to dataframe subset in R

Question

2 answers

solution1
1 ACCPTED 2014-09-26 21:14:48

solution2
-1 2014-09-24 21:07:15

asssign values to dataframe subset in R

Question

2 answers

solution1 1 ACCPTED 2014-09-26 21:14:48

solution2 -1 2014-09-24 21:07:15

solution1
1 ACCPTED 2014-09-26 21:14:48

solution2
-1 2014-09-24 21:07:15