将值分配给R中的数据框子集

Question

我在将数据帧分配给另一个子集时遇到麻烦。 在下面的示例中，

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

仅修改一列而不是两列。 我希望它要么不修改任何列，要么不修改两个列，而不仅仅是一个。 我通过将ds连接到另一个数据框cs编写了函数，以填充数据CountyID ds中PrefName和CountyID列（它们是NA 。

如您所见，如果没有运行它，则测试将失败，因为PrefName未被填充。在进行了一些调试之后，我意识到join()确实在执行预期的操作，但实际上是对该连接的结果将PrefName NA 。

# fully copy-paste-run-able (but broken) code                                                    
suppressMessages({                                                          
    library("plyr")                                                         
    library("methods")                                                      
    library("testthat")                                                     
}) 

# Fill in the missing PrefName/CountyIDs in delstat                         
#   - Find the missing values in Delstat                                    
#   - Grab the CityState Primary Record values                              
#   - Match on zipcode to fill in the holes in the delstat data             
#   - Remove any codes that could not be fixed                              
#   - @param ds: delstat dataframe with 6 columns (see test case)           
#   - @param cs: citystate dataframe with 6 columns (see test case) 
getMissingCounties <- function(ds, cs) {                                    

    if (length(is.na(ds$CountyID))) {                                       

        cavities <- which(is.na(ds$CountyID))                               
        fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)]                          

        ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

        ds <- ds[!is.na(ds$CountyID),]                                      
    }                                                                       

    return(ds)                                                              
}                                                                           
test_getMissingCounties <- function() {                                     

    ds <- data.frame(                                                       
        CityStateKey = c(1,     2,  3,  4  ),                               
        ZipCode      = c(11,    22, 33, 44 ),                               
        Business     = c(1,     1,  1,  1  ),                               
        Residential  = c(1,     1,  1,  1  ),                               
        PrefName     = c("One", NA , NA, NA),                               
        CountyID     = c(111,   NA,  NA, NA))                               

    cs <- data.frame(                                                       
        ZipCode      = c(11,    22,    22,    33,      55    ),             
        Name         = c("eh",  "eh?", "eh?", "eh!?",  "ah." ),             
        PrefName     = c("One", "To",  "Two", "Three", "Five"),             
        CountyID     = c(111,   222,   222,   333,     555   ),             
        PrimRec      = c(TRUE,  FALSE, TRUE,  TRUE,    TRUE  ),             
        CityStateKey = c(1,     2,     2,     3,       5     ))             

    expected <- data.frame(                                                 
        CityStateKey = c(1,     2,     3      ),                            
        ZipCode      = c(11,    22,    33     ),                            
        Business     = c(1,     1,     1      ),                            
        Residential  = c(1,     1,     1      ),                            
        PrefName     = c("One", "Two", "Three"),                            
        CountyID     = c(111,   222,   333    ))                            

    expect_equal(getMissingCounties(ds, cs), expected)                      
}

# run the test
test_getMissingCounties()

结果是：

CityStateKey ZipCode Business Residential PrefName CountyID
       1       11        1          1       One      111
       2       22        1          1      <NA>      222
       3       33        1          1      <NA>      333

有什么想法为什么PrefName会被分配设置为NA或如何进行分配，以免丢失数据？

Answer 1

简短的答案是，可以通过确保数据帧中没有任何因素来避免此问题。 您可以通过在data.frame(...)的调用中使用stringsAsFactors=FALSE来data.frame(...) 。 请注意，默认情况下，许多数据导入功能（包括read.table(...)和read.csv(...)也会将字符转换为因数。 您可以用相同的方法来击败这种行为。

这个问题实际上是非常微妙的，并且也是R在数据类型之间的“沉默强制”如何造成各种问题的一个很好的例子。

data.frame(...)函数默认将任何字符向量转换为因子。 因此，在您的代码中， ds$PerfName是一个具有一个级别的因子，而cs$PerfName是一个具有五个级别的因子。 因此，在您的工作分配声明中：

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

LHS的第5列是1级因子，RHS的第5列是5级因子。

在某些情况下 ，当您将具有较高级别的因子分配给具有较少级别的因子时，缺少的级别将设置为NA 。 考虑一下：

x <- c("A","B",NA,NA,NA)  # character vector          
y <- LETTERS[1:5]         # character vector
class(x); class(y)
# [1] "character"
# [1] "character"

df <- data.frame(x,y)     # x and y coerced to factor
sapply(df,class)          # df$x and df$y are factors
#        x        y 
# "factor" "factor" 

# assign rows 3:5 of col 2 to col 1
df[3:5,1] <- df[3:5,2]    # fails with a warning
# Warning message:
# In `[<-.factor`(`*tmp*`, iseq, value = 3:5) :
#   invalid factor level, NA generated
df                        # missing levels set to NA
#      x y
# 1    A A
# 2    B B
# 3 <NA> C
# 4 <NA> D
# 5 <NA> E

上面的示例等效于您的赋值语句。 但是，请注意如果将第2列的全部分配给第1列会发生什么。

# assign all of col 2 to col 1
df <- data.frame(x,y)
df[,1] <- df[,2]          # succeeds!!
df
#   x y
# 1 A A
# 2 B B
# 3 C C
# 4 D D
# 5 E E

这可行。

最后，关于调试的说明：如果要调试函数，有时在命令行 （例如，在全局环境中）逐行运行语句会很有用。 如果这样做，您将得到上面的警告，而在函数调用中，警告被抑制。

Answer 2

可以通过以下方式重新实现getMissingCountries来满足测试的约束：

merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")

注意：总是首先发出ZipCode列，这与您的预期结果不同。

但是要回答子分配问题：它会中断，因为PrefName的级别集在ds和cs之间不兼容。 避免使用因素或重新relevel它们。 您可能已经错过了R对此的警告，因为test那以某种方式抑制了警告。

将值分配给R中的数据框子集

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-09-26 21:14:48

解决方案2
-1 2014-09-24 21:07:15

将值分配给R中的数据框子集

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-09-26 21:14:48

解决方案2 -1 2014-09-24 21:07:15

解决方案1
1 已采纳 2014-09-26 21:14:48

解决方案2
-1 2014-09-24 21:07:15