[英]Reading file with missing values in R
I have a file with filename = 'fn', which I am reading as follows: 我有一个文件名为filename'fn'的文件,其内容如下:
age CALCIUM CREATININE GLUCOSE
64.3573 1.1 488
69.9043 8.1 1.1 472
65.6633 8.6 0.8 461
50.3693 8.1 1.3 418
57.0334 8.7 0.8 NEG
81.4939 1.1 NEG
56.954 9.8 1
76.9298 9.1 0.8 NEG
> tmpData = read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NA', '<NA>'), blank.lines.skip = TRUE)
> tmpData
age CALCIUM CREATININE GLUCOSE
1 64.3573 NA 1.1 488
2 69.9043 8.1 1.1 472
3 65.6633 8.6 0.8 461
4 50.3693 8.1 1.3 418
5 57.0334 8.7 0.8 NEG
6 81.4939 NA 1.1 NEG
7 56.9540 9.8 1.0 <NA>
8 76.9298 9.1 0.8 NEG
The file is read as above with missing values replaced as NA and < NA >. 如上读取文件,缺失值替换为NA和<NA>。 I guess that the 'glucose' column is treated as factor. 我想“葡萄糖”列被视为因素。 Is there an easy way to interpret < NA > as real NA and convert any non-numeric values into NA (in this example NEG into NA) 是否有一种简单的方法可以将<NA>解释为实数NA并将任何非数字值转换为NA(在本例中为NEG转换为NA)
You can take advantage of the fact that as.numeric
will coerce non-numeric values to NA
. 您可以利用as.numeric
将非数字值强制转换为NA
的事实。 In other words, try something like this: 换句话说,尝试这样的事情:
Here's your data: 这是您的数据:
temp <- structure(list(age = c(64.3573, 69.9043, 65.6633, 50.3693, 57.0334,
81.4939, 56.954, 76.9298), CALCIUM = c(1.1, 8.1, 8.6, 8.1, 8.7,
1.1, 9.8, 9.1), CREATININE = c(NA, 1.1, 0.8, 1.3, 0.8, NA, 1,
0.8), GLUCOSE = structure(c(5L, 4L, 3L, 2L, 6L, 6L, 1L, 6L), .Label = c("",
"418", "461", "472", "488", "NEG"), class = "factor")), .Names = c("age",
"CALCIUM", "CREATININE", "GLUCOSE"), class = "data.frame", row.names = c(NA,
-8L))
And its current structure: 及其当前结构:
str(temp)
# 'data.frame': 8 obs. of 4 variables:
# $ age : num 64.4 69.9 65.7 50.4 57 ...
# $ CALCIUM : num 1.1 8.1 8.6 8.1 8.7 1.1 9.8 9.1
# $ CREATININE: num NA 1.1 0.8 1.3 0.8 NA 1 0.8
# $ GLUCOSE : Factor w/ 6 levels "","418","461",..: 5 4 3 2 6 6 1 6
Convert that last column to numeric, but since it's a factor, we need to convert it to character first. 将最后一列转换为数字,但是由于这是一个因素,因此我们需要先将其转换为字符。 Note the warning. 注意警告。 We're actually happy about that. 我们实际上对此感到高兴。
temp$GLUCOSE <- as.numeric(as.character(temp$GLUCOSE))
# Warning message:
# NAs introduced by coercion
The result: 结果:
temp
# age CALCIUM CREATININE GLUCOSE
# 1 64.3573 1.1 NA 488
# 2 69.9043 8.1 1.1 472
# 3 65.6633 8.6 0.8 461
# 4 50.3693 8.1 1.3 418
# 5 57.0334 8.7 0.8 NA
# 6 81.4939 1.1 NA NA
# 7 56.9540 9.8 1.0 NA
# 8 76.9298 9.1 0.8 NA
For fun, here's a little function I put together that provides an alternative approach: 为了好玩,我整理了一个小功能,它提供了另一种方法:
makemeNA <- function (mydf, NAStrings, fixed = TRUE) {
if (!isTRUE(fixed)) {
mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x))
NAStrings <- ""
}
mydf[] <- lapply(mydf, function(x) type.convert(
as.character(x), na.strings = NAStrings))
mydf
}
This function lets you specify a regular expression to identify what should be an NA
value. 此函数使您可以指定正则表达式来标识应为NA
值的内容。 I haven't really tested it much, so use the regex feature at your own risk ! 我还没有真正测试过,所以使用正则表达式功能后果自负 !
Using the same "temp" object as above, try these out to see what the function does: 使用与上述相同的“临时”对象,尝试以下操作以查看函数的作用:
# Change anything that is just text to NA
makemeNA(temp, "[A-Za-z]", fixed = FALSE)
# Change any exact matches with "NEG" to NA
makemeNA(temp, "NEG")
# Change any matches with 3-digit integers to NA
makemeNA(temp, "^[0-9]{3}$", fixed = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.