讀取R中缺少值的文件

Question

我有一個文件名為filename'fn'的文件，其內容如下：

age CALCIUM CREATININE  GLUCOSE
64.3573     1.1 488
69.9043 8.1 1.1 472
65.6633 8.6 0.8 461
50.3693 8.1 1.3 418
57.0334 8.7 0.8 NEG
81.4939     1.1 NEG
56.954  9.8 1   
76.9298 9.1 0.8 NEG


> tmpData = read.table(fn, header = TRUE,  sep= "\t" , na.strings = c('', 'NA', '<NA>'),  blank.lines.skip = TRUE)
> tmpData
      age CALCIUM CREATININE GLUCOSE
1 64.3573            NA        1.1     488
2 69.9043           8.1        1.1     472
3 65.6633           8.6        0.8     461
4 50.3693           8.1        1.3     418
5 57.0334           8.7        0.8     NEG
6 81.4939            NA        1.1     NEG
7 56.9540           9.8        1.0    <NA>
8 76.9298           9.1        0.8     NEG

如上讀取文件，缺失值替換為NA和<NA>。 我想“葡萄糖”列被視為因素。 是否有一種簡單的方法可以將<NA>解釋為實數NA並將任何非數字值轉換為NA（在本例中為NEG轉換為NA）

Answer 1

您可以利用as.numeric將非數字值強制轉換為NA的事實。 換句話說，嘗試這樣的事情：

這是您的數據：

temp <- structure(list(age = c(64.3573, 69.9043, 65.6633, 50.3693, 57.0334, 
  81.4939, 56.954, 76.9298), CALCIUM = c(1.1, 8.1, 8.6, 8.1, 8.7, 
  1.1, 9.8, 9.1), CREATININE = c(NA, 1.1, 0.8, 1.3, 0.8, NA, 1, 
  0.8), GLUCOSE = structure(c(5L, 4L, 3L, 2L, 6L, 6L, 1L, 6L), .Label = c("", 
  "418", "461", "472", "488", "NEG"), class = "factor")), .Names = c("age", 
  "CALCIUM", "CREATININE", "GLUCOSE"), class = "data.frame", row.names = c(NA, 
  -8L))

及其當前結構：

str(temp)
# 'data.frame':  8 obs. of  4 variables:
# $ age       : num  64.4 69.9 65.7 50.4 57 ...
# $ CALCIUM   : num  1.1 8.1 8.6 8.1 8.7 1.1 9.8 9.1
# $ CREATININE: num  NA 1.1 0.8 1.3 0.8 NA 1 0.8
# $ GLUCOSE   : Factor w/ 6 levels "","418","461",..: 5 4 3 2 6 6 1 6

將最后一列轉換為數字，但是由於這是一個因素，因此我們需要先將其轉換為字符。 注意警告。 我們實際上對此感到高興。

temp$GLUCOSE <- as.numeric(as.character(temp$GLUCOSE))
# Warning message:
# NAs introduced by coercion

結果：

temp
#       age CALCIUM CREATININE GLUCOSE
# 1 64.3573     1.1         NA     488
# 2 69.9043     8.1        1.1     472
# 3 65.6633     8.6        0.8     461
# 4 50.3693     8.1        1.3     418
# 5 57.0334     8.7        0.8      NA
# 6 81.4939     1.1         NA      NA
# 7 56.9540     9.8        1.0      NA
# 8 76.9298     9.1        0.8      NA

為了好玩，我整理了一個小功能，它提供了另一種方法：

makemeNA <- function (mydf, NAStrings, fixed = TRUE) {
  if (!isTRUE(fixed)) {
    mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x))
    NAStrings <- ""
  }
  mydf[] <- lapply(mydf, function(x) type.convert(
    as.character(x), na.strings = NAStrings))
  mydf
}

此函數使您可以指定正則表達式來標識應為NA值的內容。 我還沒有真正測試過，所以使用正則表達式功能后果自負 ！

使用與上述相同的“臨時”對象，嘗試以下操作以查看函數的作用：

# Change anything that is just text to NA
makemeNA(temp, "[A-Za-z]", fixed = FALSE)
# Change any exact matches with "NEG" to NA
makemeNA(temp, "NEG")
# Change any matches with 3-digit integers to NA
makemeNA(temp, "^[0-9]{3}$", fixed = FALSE)

讀取R中缺少值的文件

問題描述

1 個解決方案

解決方案1
4 2013-02-15 16:00:15

讀取R中缺少值的文件

問題描述

1 個解決方案

解決方案1 4 2013-02-15 16:00:15

解決方案1
4 2013-02-15 16:00:15