简体   繁体   English

如何用中位数填充NA?

[英]How to fill NA with median?

Example data: 示例数据:

set.seed(1)
df <- data.frame(years=sort(rep(2005:2010, 12)), 
                 months=1:12, 
                 value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))

head(df)
  years months      value
1  2005      1 -0.6264538
2  2005      2  0.1836433
3  2005      3 -0.8356286
4  2005      4  1.5952808
5  2005      5  0.3295078
6  2005      6 -0.8204684

Tell me please, how i can replace NA in df$value to median of others months? 请告诉我,我怎样才能将df $值中的NA替换为其他月份的中位数? "value" must contain the median of value of all previous values for the same month. “value”必须包含同月所有先前值的中值。 That is, if current month is May, "value" must contain the median value for all previous values of the month of May. 也就是说,如果当前月份是5月,则“值”必须包含5月份所有先前值的中值。

Or with ave 或者与大道

df <- data.frame(years=sort(rep(2005:2010, 12)),
months=1:12,
value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
df$value[is.na(df$value)] <- with(df, ave(value, months, 
   FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)]

Since there are so many answers let's see which is fastest. 由于答案太多,让我们看看哪个答案最快。

plyr2 <- function(df){
  medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE))
df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)]
  df
}
library(plyr)
library(data.table)
DT <- data.table(df)
setkey(DT, months)


benchmark(ave = df$value[is.na(df$value)] <- 
  with(df, ave(value, months, 
               FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)],
          tapply = df$value[61:72] <- 
            with(df, tapply(value, months, median, na.rm=TRUE)),
          sapply = df[61:72, 3] <- sapply(split(df[1:60, 3], df[1:60, 2]), median),
          plyr = ddply(df, .(months), transform, 
                       value=ifelse(is.na(value), median(value, na.rm=TRUE), value)),
          plyr2 = plyr2(df),
          data.table = DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months],
          order = "elapsed")
        test replications elapsed relative user.self sys.self user.child sys.child
3     sapply          100   0.209 1.000000     0.196    0.000          0         0
1        ave          100   0.260 1.244019     0.244    0.000          0         0
6 data.table          100   0.271 1.296651     0.264    0.000          0         0
2     tapply          100   0.271 1.296651     0.256    0.000          0         0
5      plyr2          100   1.675 8.014354     1.612    0.004          0         0
4       plyr          100   2.075 9.928230     2.004    0.000          0         0

I would have bet that data.table was the fastest. 我敢打赌data.table是最快的。

[ Matthew Dowle ] The task being timed here takes at most 0.02 seconds (2.075/100). [Matthew Dowle]这里定时的任务最多需要0.02秒(2.075 / 100)。 data.table considers that insignificant. data.table认为这是微不足道的。 Try setting replications to 1 and increasing the data size, instead. 尝试将replications设置为1并增加数据大小。 Or timing the fastest of 3 runs is also a common rule of thumb. 或者3次运行中最快的时间也是一个常见的经验法则。 More verbose discussion in these links : 在这些链接中更详细的讨论:

you want to use the test is.na function: 你想使用测试is.na功能:

df$value[is.na(df$value)] <- median(df$value, na.rm=TRUE)

which says for all the values where df$value is NA , replace it with the right hand side. 对于df$valueNA所有值,请将其替换为右侧。 You need the na.rm=TRUE piece or else the median function will return NA 你需要na.rm=TRUE片段,否则中median函数将返回NA

to do this month by month, there are many choices, but i think plyr has the simplest syntax: 要逐月做这个,有很多选择,但我认为plyr有最简单的语法:

library(plyr)
ddply(df, 
      .(months), 
      transform, 
      value=ifelse(is.na(value), median(value, na.rm=TRUE), value))

you can also use data.table . 你也可以使用data.table this is an especially good choice if your data is large: 如果您的数据很大,这是一个特别好的选择:

library(data.table)
DT <- data.table(df)
setkey(DT, months)

DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months]

There are many other ways, but there are two! 还有很多其他方法,但有两种方式!

Here's the most robust solution I can think of. 这是我能想到的最强大的解决方案。 It ensures the years are ordered correctly and will correctly compute the median for all previous months in cases where you have multiple years with missing values. 它可确保正确订购年份,并在您有多年缺失值的情况下正确计算所有前几个月的中位数。

# first, reshape your data so it is years by months:
library(reshape2)
tmp <- dcast(years ~ months, data=df)  # convert data to years x months
tmp <- tmp[order(tmp$years),]          # order years
# now calculate the running median on each month
library(caTools)
# function to replace NA with rolling median
tmpfun <- function(x) {
  ifelse(is.na(x), runquantile(x, k=length(x), probs=0.5, align="right"), x)
}
# apply tmpfun to each column and convert back to data.frame
tmpmed <- as.data.frame(lapply(tmp, tmpfun))
# reshape back to long and convert 'months' back to integer
res <- melt(tmpmed, "years", variable.name="months")
res$months <- as.integer(gsub("^X","",res$months))

There is another way to do this with dplyr . 使用dplyr还有另一种方法可以做到这dplyr

If you want to replace all columns with their median, do: 如果要用中位数替换所有列,请执行以下操作:

library(dplyr)
df %>% 
   mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))

If you want to replace a subset of columns (such as "value" in OP's example), do: 如果要替换列的子集(例如OP示例中的“value”),请执行以下操作:

df %>% 
  mutate_at(vars(value), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

Sticking with base R, you can also try the following: 坚持使用基础R,您还可以尝试以下方法:

medians = sapply(split(df[1:60, 3], df[1:60, 2]), median)
df[61:72, 3] = medians

This is a way using plyr , it is not very pretty but I think it does what you want: 这是使用plyr一种方式,它不是很漂亮,但我认为它做你想要的:

library("plyr")

# Make a separate dataframe with month as first column and median as second:
medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE))

# Replace `NA` values in `df$value` with medians from the second data frame
# match() here ensures that the medians are entered in the correct elements.
df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM