繁体   English   中英

MonetDB.R的na.omit是否损坏?

[英]Is MonetDB.R's `na.omit` broken?

我认为MonetDB.R过滤NA的方式存在错误,请参见下面的示例代码:

方便的实用程序函数,用于对monet.frame对象执行常规SQL查询:

#' Apply general SQL queries to a monet.frame object and return the 
#' result in a new monet.frame.
#' 
#' @note Likely to break if  \code{attr(data, "query")} contains 
#'      LIMIT or OFFSET statements.
#' 
#' @param _data a monet.frame object
#' @param query an SQL query, using "_DATA_" as the placeholder for the
#'     name of the table underlying the \code{_data}-object.
#' @param keep_order should ORDER BY statements in the original query be kept? 
#'     Will break if columns in the ORDER BY statement are not in the returned 
#'     table.
#' @importFrom stringr str_extract_all 
#' @export   
transform.monet.frame <- function(`_data`, query, keep_order=TRUE, ...){
    stopifnot(require(stringr))
    nm <- paste(sample(letters, 15, rep=TRUE), collapse="")
    oldquery <- attr(`_data`, "query")
    if(has_order <- grepl("(ORDER BY)", attr(`_data`, "query"))){
        pattern <- "(ORDER BY[[:space:]]+[[:alnum:]]+((,[[:space:]]*[[:alnum:]]+)*))"
        pattern <- ignore.case(pattern)
        orderby <- str_extract_all(oldquery, pattern)[[1]]
        oldquery <- gsub(pattern, "", oldquery, ignore.case = TRUE)
    } 
    query <- gsub("_DATA_", paste("(", oldquery, ") AS", nm), query)
    if(has_order & keep_order) query <- paste(query, orderby)
    monet.frame(attr(`_data`, "conn"), query)
}

例:

# library(MonetDB.R); monetdb <- dbConnect( MonetDB.R(), ... etc
set.seed(1212)
tablename <- paste(sample(letters, 10), collapse="")
data  <- data.frame(x=rnorm(100), f=gl(2, 50))

# introduce some NAs ...
data$xna <- data$x
data$xna[1:10] <- NA

dbWriteTable(monetdb, tablename, data)
dm <- monet.frame(monetdb, tablename)   

str(na.omit(dm$xna))
# MonetDB-backed data.frame surrogate
# 1 column, 100 rows
# Query: SELECT xna FROM gcxinabtme WHERE (  NOT (('xna') IS NULL) ) 
# Columns: xna (numeric)

100行!?! 应该是90 ...

nrow(transform(dm, "SELECT xna FROM _DATA_ WHERE (xna IS NOT NULL)"))
# 90 
## as it should be
nrow(transform(dm, "SELECT xna FROM _DATA_ WHERE ('xna' IS NOT NULL)"))
# 100
## so quoting the column name seems to mess this up..   

我想我理解为什么必须引用列名(所以这也适用于非标准列名,对吗?),但是为什么这会弄乱查询结果? 这两个查询不应该完全等效吗? 另外,如果真的需要引用列名,那么为什么第一次出现xna没有用引号引起来

# Query: SELECT xna FROM gcxinabtme WHERE (  NOT (('xna') IS NULL) ) 

我注意到了这一点,因为它也使其他monet.frame方法的行为monet.frame ,例如:

 quantile(dm$xna, na.rm=TRUE)
 # 0%        25%        50%        75%       100% 
 # NA -0.9974738 -0.3033412  0.4272321  2.6715264 

编辑添加:

na.fail似乎也被破坏了:

它不会引发错误,而是在应用于包含NA的列时返回NULL,并带有一个隐含的警告,乍一看实际上没有NA:

str(na.fail(dm$xna))
# NULL
# Warning message:
# In monet.frame.internal(attr(x, "conn"), nquery, .is.debug(x), nrow.hint = NA,  :
#   SELECT xna FROM gcxinabtme WHERE ( ('xna') IS NULL )  has zero-row result set.

如果没有NA, na.fail()根据通用文档, na.fail()应该返回其参数不变,但它也不这样做:

str(na.fail(dm$x))
# NULL
# Warning message:
# In monet.frame.internal(attr(x, "conn"), nquery, .is.debug(x), nrow.hint = NA,  :
#   SELECT x FROM gcxinabtme WHERE ( ('x') IS NULL )  has zero-row result set.

列名的引号应使用双引号。 将调查为什么不这样做。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM