简体   繁体   English

将值分配给R中的数据框子集

[英]asssign values to dataframe subset in R

I'm having trouble assigning a dataframe to a subset of another. 我在将数据帧分配给另一个子集时遇到麻烦。 In the example below, the line 在下面的示例中,

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

only modifies one column instead of two. 仅修改一列而不是两列。 I would expect it either to modify no columns or both, not only one. 我希望它要么不修改任何列,要么不修改两个列,而不仅仅是一个。 I wrote the function to fill in the PrefName and CountyID columns in dataframe ds where they are NA by joining ds to another dataframe cs . 我通过将ds连接到另一个数据框cs编写了函数,以填充数据CountyID dsPrefNameCountyID列(它们是NA

As you can see if you run it, the test is failing because PrefName is not getting filled in. After doing a bit of debugging, I realized that join() is doing exactly what it is expected to do, but the actual assignment of the result of that join somehow drops the PrefName back to a NA . 如您所见,如果没有运行它,则测试将失败,因为PrefName未被填充。在进行了一些调试之后,我意识到join()确实在执行预期的操作,但实际上是对该连接的结果PrefName NA

# fully copy-paste-run-able (but broken) code                                                    
suppressMessages({                                                          
    library("plyr")                                                         
    library("methods")                                                      
    library("testthat")                                                     
}) 

# Fill in the missing PrefName/CountyIDs in delstat                         
#   - Find the missing values in Delstat                                    
#   - Grab the CityState Primary Record values                              
#   - Match on zipcode to fill in the holes in the delstat data             
#   - Remove any codes that could not be fixed                              
#   - @param ds: delstat dataframe with 6 columns (see test case)           
#   - @param cs: citystate dataframe with 6 columns (see test case) 
getMissingCounties <- function(ds, cs) {                                    

    if (length(is.na(ds$CountyID))) {                                       

        cavities <- which(is.na(ds$CountyID))                               
        fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)]                          

        ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

        ds <- ds[!is.na(ds$CountyID),]                                      
    }                                                                       

    return(ds)                                                              
}                                                                           
test_getMissingCounties <- function() {                                     

    ds <- data.frame(                                                       
        CityStateKey = c(1,     2,  3,  4  ),                               
        ZipCode      = c(11,    22, 33, 44 ),                               
        Business     = c(1,     1,  1,  1  ),                               
        Residential  = c(1,     1,  1,  1  ),                               
        PrefName     = c("One", NA , NA, NA),                               
        CountyID     = c(111,   NA,  NA, NA))                               

    cs <- data.frame(                                                       
        ZipCode      = c(11,    22,    22,    33,      55    ),             
        Name         = c("eh",  "eh?", "eh?", "eh!?",  "ah." ),             
        PrefName     = c("One", "To",  "Two", "Three", "Five"),             
        CountyID     = c(111,   222,   222,   333,     555   ),             
        PrimRec      = c(TRUE,  FALSE, TRUE,  TRUE,    TRUE  ),             
        CityStateKey = c(1,     2,     2,     3,       5     ))             

    expected <- data.frame(                                                 
        CityStateKey = c(1,     2,     3      ),                            
        ZipCode      = c(11,    22,    33     ),                            
        Business     = c(1,     1,     1      ),                            
        Residential  = c(1,     1,     1      ),                            
        PrefName     = c("One", "Two", "Three"),                            
        CountyID     = c(111,   222,   333    ))                            

    expect_equal(getMissingCounties(ds, cs), expected)                      
}

# run the test
test_getMissingCounties()

The results are: 结果是:

CityStateKey ZipCode Business Residential PrefName CountyID
       1       11        1          1       One      111
       2       22        1          1      <NA>      222
       3       33        1          1      <NA>      333

Any ideas why PrefName is getting set to NA by the assignment or how to do the assignment so I don't lose data? 有什么想法为什么PrefName会被分配设置为NA或如何进行分配,以免丢失数据?

The short answer is that you can avoid this problem by making sure that there are no factors in your data frames. 简短的答案是,可以通过确保数据帧中没有任何因素来避免此问题。 You do this by using stringsAsFactors=FALSE in the call(s) to data.frame(...) . 您可以通过在data.frame(...)的调用中使用stringsAsFactors=FALSEdata.frame(...) Note that many of the data import functions, including read.table(...) and read.csv(...) also convert character to factor by default. 请注意,默认情况下,许多数据导入功能(包括read.table(...)read.csv(...)也会将字符转换为因数。 You can defeat this behavior the same way. 您可以用相同的方法来击败这种行为。

This problem is actually quite subtle, and is also a good example of how R's "silent coercion" between data types creates all sorts of problems. 这个问题实际上是非常微妙的,并且也是R在数据类型之间的“沉默强制”如何造成各种问题的一个很好的例子。

The data.frame(...) function converts any character vectors to factors by default. data.frame(...)函数默认将任何字符向量转换为因子。 So in your code ds$PerfName is a factor with one level, and cs$PerfName is a factor with 5 levels. 因此,在您的代码中, ds$PerfName是一个具有一个级别的因子,而cs$PerfName是一个具有五个级别的因子。 So in your assignment statement: 因此,在您的工作分配声明中:

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

the 5th column on the LHS is a factor with 1 level, and the 5th column on the RHS is a factor with 5 levels. LHS的第5列是1级因子,RHS的第5列是5级因子。

Under some circumstances , when you assign a factor with more levels to a factor with fewer levels, the missing levels are set to NA . 在某些情况下 ,当您将具有较高级别的因子分配给具有较少级别的因子时,缺少的级别将设置为NA Consider this: 考虑一下:

x <- c("A","B",NA,NA,NA)  # character vector          
y <- LETTERS[1:5]         # character vector
class(x); class(y)
# [1] "character"
# [1] "character"

df <- data.frame(x,y)     # x and y coerced to factor
sapply(df,class)          # df$x and df$y are factors
#        x        y 
# "factor" "factor" 

# assign rows 3:5 of col 2 to col 1
df[3:5,1] <- df[3:5,2]    # fails with a warning
# Warning message:
# In `[<-.factor`(`*tmp*`, iseq, value = 3:5) :
#   invalid factor level, NA generated
df                        # missing levels set to NA
#      x y
# 1    A A
# 2    B B
# 3 <NA> C
# 4 <NA> D
# 5 <NA> E

The example above is equivalent to your assignment statement. 上面的示例等效于您的赋值语句。 However, notice what happens if you assign all of column 2 to column 1. 但是,请注意如果将第2列的全部分配给第1列会发生什么。

# assign all of col 2 to col 1
df <- data.frame(x,y)
df[,1] <- df[,2]          # succeeds!!
df
#   x y
# 1 A A
# 2 B B
# 3 C C
# 4 D D
# 5 E E

This works. 这可行。

Finally, a note on debugging: if you are debugging a function, sometimes it is useful to run through the statements line by line at the command line (eg, in the global environment). 最后,关于调试的说明:如果要调试函数,有时在命令行 (例如,在全局环境中)逐行运行语句会很有用。 If you did that, you would have gotten the warning above, whereas inside a function call the warnings are suppressed. 如果这样做,您将得到上面的警告,而在函数调用中,警告被抑制。

The constraints of the test can be satisfied by reimplementing getMissingCountries with: 可以通过以下方式重新实现getMissingCountries来满足测试的约束:

merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")

Caveat: the ZipCode column is always emitted first, which differs from your expected result. 注意:总是首先发出ZipCode列,这与您的预期结果不同。

But to answer the subassignment question: it breaks, because the level sets of PrefName are incompatible between ds and cs . 但是要回答子分配问题:它会中断,因为PrefName的级别集在dscs之间不兼容。 Either avoid using a factor or relevel them. 避免使用因素或重新relevel它们。 You might have missed R's warning about this, because testthat was somehow suppressing warnings. 您可能已经错过了R对此的警告,因为test那以某种方式抑制了警告。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM