简体   繁体   English

MonetDB.R的na.omit是否损坏?

[英]Is MonetDB.R's `na.omit` broken?

I think there's a bug in how MonetDB.R filters NAs, see example code below: 我认为MonetDB.R过滤NA的方式存在错误,请参见下面的示例代码:

Handy utility function for doing general SQL-queries on monet.frame objects: 方便的实用程序函数,用于对monet.frame对象执行常规SQL查询:

#' Apply general SQL queries to a monet.frame object and return the 
#' result in a new monet.frame.
#' 
#' @note Likely to break if  \code{attr(data, "query")} contains 
#'      LIMIT or OFFSET statements.
#' 
#' @param _data a monet.frame object
#' @param query an SQL query, using "_DATA_" as the placeholder for the
#'     name of the table underlying the \code{_data}-object.
#' @param keep_order should ORDER BY statements in the original query be kept? 
#'     Will break if columns in the ORDER BY statement are not in the returned 
#'     table.
#' @importFrom stringr str_extract_all 
#' @export   
transform.monet.frame <- function(`_data`, query, keep_order=TRUE, ...){
    stopifnot(require(stringr))
    nm <- paste(sample(letters, 15, rep=TRUE), collapse="")
    oldquery <- attr(`_data`, "query")
    if(has_order <- grepl("(ORDER BY)", attr(`_data`, "query"))){
        pattern <- "(ORDER BY[[:space:]]+[[:alnum:]]+((,[[:space:]]*[[:alnum:]]+)*))"
        pattern <- ignore.case(pattern)
        orderby <- str_extract_all(oldquery, pattern)[[1]]
        oldquery <- gsub(pattern, "", oldquery, ignore.case = TRUE)
    } 
    query <- gsub("_DATA_", paste("(", oldquery, ") AS", nm), query)
    if(has_order & keep_order) query <- paste(query, orderby)
    monet.frame(attr(`_data`, "conn"), query)
}

Example: 例:

# library(MonetDB.R); monetdb <- dbConnect( MonetDB.R(), ... etc
set.seed(1212)
tablename <- paste(sample(letters, 10), collapse="")
data  <- data.frame(x=rnorm(100), f=gl(2, 50))

# introduce some NAs ...
data$xna <- data$x
data$xna[1:10] <- NA

dbWriteTable(monetdb, tablename, data)
dm <- monet.frame(monetdb, tablename)   

str(na.omit(dm$xna))
# MonetDB-backed data.frame surrogate
# 1 column, 100 rows
# Query: SELECT xna FROM gcxinabtme WHERE (  NOT (('xna') IS NULL) ) 
# Columns: xna (numeric)

100 rows !?! 100行!?! should be 90... 应该是90 ...

nrow(transform(dm, "SELECT xna FROM _DATA_ WHERE (xna IS NOT NULL)"))
# 90 
## as it should be
nrow(transform(dm, "SELECT xna FROM _DATA_ WHERE ('xna' IS NOT NULL)"))
# 100
## so quoting the column name seems to mess this up..   

I think I understand why quoting the column name is necessary (so this works for non-standard column names as well, right?), but why would this mess up the query result? 我想我理解为什么必须引用列名(所以这也适用于非标准列名,对吗?),但是为什么这会弄乱查询结果? Shouldn't these two be perfectly equivalent queries? 这两个查询不应该完全等效吗? Also, if it's really necessary to quote the column names, why is the first occurence of xna not quoted in 另外,如果真的需要引用列名,那么为什么第一次出现xna没有用引号引起来

# Query: SELECT xna FROM gcxinabtme WHERE (  NOT (('xna') IS NULL) ) 

I noticed this because it also makes other monet.frame -methods behave unexpectedly, eg: 我注意到了这一点,因为它也使其他monet.frame方法的行为monet.frame ,例如:

 quantile(dm$xna, na.rm=TRUE)
 # 0%        25%        50%        75%       100% 
 # NA -0.9974738 -0.3033412  0.4272321  2.6715264 

EDITED to add: 编辑添加:

na.fail seems to be broken as well: na.fail似乎也被破坏了:

It does not raise an error, but instead returns NULL when applied to a column holding NAs, with a cryptic warning that would indicate at first glance that there are, in fact, no NAs: 它不会引发错误,而是在应用于包含NA的列时返回NULL,并带有一个隐含的警告,乍一看实际上没有NA:

str(na.fail(dm$xna))
# NULL
# Warning message:
# In monet.frame.internal(attr(x, "conn"), nquery, .is.debug(x), nrow.hint = NA,  :
#   SELECT xna FROM gcxinabtme WHERE ( ('xna') IS NULL )  has zero-row result set.

If there are no NAs, na.fail() should return its argument unchanged according to the generic's documentation, but it doesn't do that either: 如果没有NA, na.fail()根据通用文档, na.fail()应该返回其参数不变,但它也不这样做:

str(na.fail(dm$x))
# NULL
# Warning message:
# In monet.frame.internal(attr(x, "conn"), nquery, .is.debug(x), nrow.hint = NA,  :
#   SELECT x FROM gcxinabtme WHERE ( ('x') IS NULL )  has zero-row result set.

the quoting of the column name should use double quotes. 列名的引号应使用双引号。 Will investigate why it is not doing so. 将调查为什么不这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM