简体   繁体   English

替换NA,否则 <NA> 在数据框的列中有其他东西

[英]replace NA or else <NA> with something or something else in column of data frame

I have read what seem to be related posts, but am evidently much too much a Noob to understand or make anything work... 我已经看过似乎相关的帖子,但是显然Noob太多了,无法理解或不起作用。

> df
  ID Area Address
1 NA    1    lane
2 11   NA    road
3 12    2    blvd
4 13    5    <NA>

> str(df)
'data.frame':   4 obs. of  3 variables:
 $ ID     : int  NA 11 12 13
 $ Area   : int  1 NA 2 5
 $ Address: Factor w/ 3 levels "blvd","lane",..: 2 3 1 NA

I want to be able -- not just for the data frame above, but for larger data frames having many more rows and many more columns -- to replace in whatever columns I choose (which I reference by column names) all occurences of 我希望能够-不仅对于上面的数据帧,而且对于具有更多行和更多列的较大数据帧-能够在我选择的任何列中替换(我用列名引用)所有出现的情况

<NA>

with an element of my choosing from 从我的选择中

<NA> , NA, "foo", "", 0

and whatever performs the replacement does not break or give error(s) when there is no 如果没有,执行替换的任何操作都不会中断或给出错误

<NA>

to replace. 取代。 Likewise, I want to perform an analogous replacement for 同样,我想执行类似的替换

NA

in whatever columns I choose without breakage or errors. 在我选择的任何列中都不会损坏或出现错误。

If there are technical reasons as to why I cannot do what I propose, then what can I do to come as close as possible to the above (while sticking to data frames -- converting to and fro with something else is ok if the answer is very explicit as to how exactly to manage the conversions -- and preserving factors in the sense that, for example, the Address column is a factor so after the replacement it should still be a factor). 如果有技术原因无法执行我建议的操作,那么我该怎么做才能尽可能接近上述要求(在坚持数据帧的情况下-如果答案是肯定的,则可以与其他方法来回转换)非常明确地说明了如何精确地管理转换-并在某种程度上保留了因素,例如,“地址”列是一个因素,因此在替换后,它仍然应该是一个因素)。

I expect there are technical reasons as to why I cannot do what I propose (I am confused to the point of asking the impossible), so I am hoping to come as close as reality permits, and that some kind soul will explain the extent to which I can come close to the above as well as how exactly to get however close is possible. 我希望有技术原因可以解释为什么我不能做到我的建议(我很困惑地问不可能的事情),所以我希望在现实允许的情况下尽可能地接近,并且某种善良的灵魂将解释我可以接近以上所述,以及如何精确地接近。

Please help (do not assume I can possibly understand without a detailed explicit answer). 请帮忙(如果没有详细的明确回答,不要以为我可以理解)。

Thanks 谢谢

A character string cannot be inserted into a numeric or integer vector without making the entire vector character but we can insert a zero in place of NA and we do that below. 在不制作整个矢量字符的情况下,不能将字符串插入数字或整数矢量,但是我们可以在NA处插入零,然后在下面进行操作。 Also we insert fill having default "foo" as a new level in place of NA for factors of the sort shown in the question. 此外,对于问题中所示类型的因子,我们将具有默认值"foo" fill作为新级别代替NA。

1) Looking at df.orig shown reproducibly at the end it has integer and factor columns and the following works for those as well as numeric columns which are double. 1)查看可df.orig显示的df.orig ,它具有整数和因子列,以下内容适用于那些整数列和数字列。 For numeric (double and integer) we assign 0L so that integer columns are not changed to double. 对于数字(双精度和整数),我们指定0L,以便整数列不会更改为双精度。 The 0L will automatically be coerced to double for double columns. 0L将自动强制为双列的两倍。 For factors having NA values, we add the NA as the last level and then change its label to fill . 对于具有NA值的因子,我们将NA添加为最后一个级别,然后将其标签更改为fill We also check if there are any NA levels and, if so, replace them with fill . 我们还会检查是否存在任何NA水平,如果有的话,将其替换为fill One would not ordinarily find both situations. 人们通常不会同时找到两种情况。 You will need to extend the code below if it is necessary to convert other classes not shown in the question. 如果需要转换问题中未显示的其他类,则需要扩展下面的代码。

df <- df.orig

# numeric (integer and double)
isNum <- sapply(df, is.numeric)
na2zero <- function(v, ...) replace(v, is.na(v), 0L)
df[isNum] <- lapply(df[isNum], na2zero)

# factor
isFactor <- sapply(df, is.factor)
na2fill <- function(v, fill = "foo", ...) { 
      # handle NA values
      if (any(is.na(v))) {
         v <- addNA(v)
         levels(v)[nlevels(v)] <- fill
      }
      # handle NA levels
      if (any(is.na(levels(v)))) levels[is.na(levels(v))] <- fill
      v 
}
df[isFactor] <- lapply(df[isFactor], na2fill)

giving: 赠送:

> df
  ID Area Address
1  0    1    lane
2 11    0    road
3 12    2    blvd
4 13    5     foo

2) Alternately, we could use S3 to do it more compactly where na2zero and na2fill are from (1). 2)另外,在na2zerona2fill来自(1)的地方,我们可以使用S3使其更紧凑。

rmNA <- function(v, ...) UseMethod("rmNA")
rmNA.numeric <- na2zero
rmNA.factor <- na2fill
rmNA.default <- function(v, ...) v # do not process other classes

df <- df.orig
df[] <- lapply(df, rmNA)

Note: df in reproducible form is: 注意:可复制形式的df为:

df.orig <- 
structure(list(ID = c(NA, 11L, 12L, 13L), Area = c(1L, NA, 2L, 
5L), Address = structure(c(2L, 3L, 1L, NA), .Label = c("blvd", 
"lane", "road"), class = "factor")), .Names = c("ID", "Area", 
"Address"), class = "data.frame", row.names = c("1", "2", "3", 
"4"))

If your data frame is named df like you show in your question just type: 如果您的数据框被命名为df,如您在问题中所示,请键入:

df[is.na(df)] <- 0

Just be sur of the name of your data frame and if it's not df just replace df with the name how you asign to your data frame. 只需使用数据框的名称即可,如果不是df,只需将df替换为您分配数据框的名称即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM