简体   繁体   English

如何根据下一个实际值填写NA值,但在前面的NA之间划分该值?

[英]How can I fill in NA values based on the next real value but divide that value between the preceding NAs?

Please note: this is a hyper simplified explanation of where the 'data' comes from, but where the data is from is irrelevant to the coding question. 请注意:这是对“数据”来自何处的超简化说明,但数据来自哪里与编码问题无关。

I have a data set created by collecting water in a tube everyday. 我每天通过在管中收集水来创建数据集。 I can't go and measure the tube every day (but the tube keeps filling) so there are gaps in the water value records. 我不能每天测量管子(但管子不断填充),所以水值记录中有空隙。 This dummy data set shows where this has happened on days 5 and 10, because this is a dummy dataset I have made an assumption that each day 500ml of water goes into the tube (the real data set is a alot messier!) 这个虚拟数据集显示了第5天和第10天发生的情况,因为这是一个虚拟数据集,我假设每天有500毫升的水进入管中(真正的数据集很多!)

dummy data 虚拟数据

day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
value<-c(500,500,500,500,NA,1000,NA,NA,NA,2000,500,500)
df<-data.frame(day,value)

Data explanation: I have collected every day for days 1:4 so the value for each day is 500ml, missed day 5 so the value is NA, collected on day 6 so the value is 1000ml (the water from day 5 and day 6 combined), missed 7,8,9, so values equal NA, collected on day 10 to give a value of 2000ml for the 4 days) then collected every day for the last two) 数据说明:我每天都会收集1:4天,所以每天的价值是500ml,错过第5天所以价值是NA,在第6天收集,所以价值是1000ml(第5天和第6天的水合并),错过了7,8,9,因此值等于NA,在第10天收集,给出4天的2000ml值)然后每天收集最后两天)

I would like to fill in the NA gaps by taking the value of the next 'real' measurement and dividing that value between the NA's and that value's day.Yes, I am assuming that if I have not made a measurement there is a constant process and that I can divide the last measurement equally between the days. 我想通过获取下一个“真实”测量的值并将该值除以NA和该值的日期来填补NA间隙。是的,我假设如果我没有进行测量,则会有一个恒定的过程并且我可以在日期之间平均分配最后一次测量。

this is what the output data should look like 这就是输出数据的样子

day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
corrected.value<-c(500,500,500,500,500,500,500,500,500,500,500,500)
corrected.df<-data.frame(day,corrected.value)

Again this is just a dummy data set otherwise the easiest way would just be replace NA with 500 with ' value[is.na(value)] <- 500 ', but in the real data set the values can be 457.6, 779, 376, etc. Also tried to do a loop but keep getting stuck... Any ideas on how I can do this? 再次,这只是一个虚拟数据集,否则最简单的方法就是用' value[is.na(value)] <- 500 '替换NA为500,但在实际数据集中,值可以是457.6,779,376还试图做一个循环,但一直卡住......有关如何做到这一点的任何想法?

Help is greatly appreciated 非常感谢帮助

Here's a possible solution : 这是一个可能的解决方案:

# Create test Data: 
# note that this is slightly different from your input
# but in this way you can better verify that it works as expected
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
value<-c(NA,500,500,500,NA,3000,NA,NA,NA,5000,500,500,NA,NA,NA)
df<-data.frame(day,value)


# "Cleansing" starts here :
RLE <- rle(is.na(df$value))

# we cannot do anything if last values are NAs, we'll just keep them in the data.frame
if(tail(RLE$values,1)){
  RLE$lengths <- head(RLE$lengths,-1)
  RLE$values <- head(RLE$values,-1)
}

afterNA <- cumsum(RLE$lengths)[RLE$values] + 1
firstNA <- (cumsum(RLE$lengths)- RLE$lengths + 1)[RLE$values]
occurences <- afterNA - firstNA + 1
replacements <- df$value[afterNA] / occurences

df$value[unlist(Map(f=seq.int,firstNA,afterNA))] <- rep.int(replacements,occurences)

Result : 结果:

> df
   day value
1    1   250
2    2   250
3    3   500
4    4   500
5    5  1500
6    6  1500
7    7  1250
8    8  1250
9    9  1250
10  10  1250
11  11   500
12  12   500
13  13    NA
14  14    NA
15  15    NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM