简体   繁体   English

从R中的数据框列表中匹配日期

[英]Match dates from list of data frames in R

I have a list of 100+ time series dataframes my.list with daily observations for each product in its own data frame. 我有一个100多个时间序列数据框my.list的列表,其中每个产品都有自己的数据框的每日观察结果。 Some values are NA without any record of the date. 有些值为NA,没有任何日期记录。 I would like to update each data frame in this list to show the date and NA if it does not have a record on this date. 我想更新此列表中的每个数据框以显示日期和NA如果该日期没有记录)。

Dates: 日期:

start = as.Date('2016/04/08')
full <- seq(start, by='1 days', length=10)

Sample Time Series Data: 样本时间序列数据:

d1 <- data.frame(Date = seq(start, by ='2 days',length=5), Sales = c(5,10,15,20,25))
d2 <- data.frame(Date = seq(start, by= '1 day', length=10),Sales = c(1, 2, 3,4,5,6,7,8,9,10))
my.list <- list(d1, d2)

I want to merge all full date values into each data frame, and if no match exists then sales is NA: 我想将所有full日期值合并到每个数据框中,如果不存在匹配项,则sales为NA:

   my.list

[[d1]]
Date    Sales
2016-04-08    5
2016-04-09    NA
2016-04-10    10
2016-04-11    NA
2016-04-12    15
2016-04-13    NA
2016-04-14    20
2016-04-15    NA
2016-04-16    25
2016-04-17    NA


[[d2]]
Date    Sales
2016-04-08    1
2016-04-09    2
2016-04-10    3
2016-04-11    4
2016-04-12    5
2016-04-13    6
2016-04-14    7
2016-04-15    8
2016-04-16    9
2016-04-17    10

If I understand correctly, the OP wants to update each of the dataframes in my.list to contain one row for each date given in the vector of dates full 如果我理解正确,那么OP希望更新my.list每个数据my.list以在full日期向量中给定的每个日期包含一行

Base R 基数R

In base R, merge() can be used as already mentioned by Hack-R . 在基础R中,可以像Hack-R所提到的那样使用merge() However, th answer below expands this to work on all dataframes in the list: 但是,下面的答案将其扩展为适用于列表中的所有数据框:

# creat dataframe from vector of full dates
full.df <- data.frame(Date = full)
# apply merge on each dataframe in the list
lapply(my.list, merge, y = full.df, all.y = TRUE)
 [[1]] Date Sales 1 2016-04-08 5 2 2016-04-09 NA 3 2016-04-10 10 4 2016-04-11 NA 5 2016-04-12 15 6 2016-04-13 NA 7 2016-04-14 20 8 2016-04-15 NA 9 2016-04-16 25 10 2016-04-17 NA [[2]] Date Sales 1 2016-04-08 1 2 2016-04-09 2 3 2016-04-10 3 4 2016-04-11 4 5 2016-04-12 5 6 2016-04-13 6 7 2016-04-14 7 8 2016-04-15 8 9 2016-04-16 9 10 2016-04-17 10 

Caveat 警告

The answer assumes that full covers the overall range of Date of all dataframes in the list. 答案假定full覆盖了列表中所有数据框的Date范围。

In order to avoid any mishaps, the overall range of Date can be retrieved from the available data in my.list : 为了避免任何意外,可以从my.list的可用数据中检索Date的整个范围:

overall_date_range <- Reduce(range, lapply(my.list, function(x) range(x$Date)))
full <- seq(overall_date_range[1], overall_date_range[2], by = "1 days")

Using rbindlist() 使用rbindlist()

Alternatively, the list of dataframes which are identical in structure can be stored in a large dataframe. 或者,可以将结构相同的数据帧列表存储在较大的数据帧中。 An additional attribute indicates to which product each row belongs to. 附加属性指示每行属于哪个产品。 The homogeneous structure simplifies subsequent operations. 均匀的结构简化了后续操作。

The code below uses the rbindlist() function from the data.table package to create a large data.table . 下面的代码使用data.table包中的rbindlist()函数创建一个大的data.table CJ() ( cross join ) creates all combinations of dates and product id which is then merged / joined to fill in the missing dates: CJ()cross join )创建日期和产品ID的所有组合,然后合并/合并以填充缺少的日期:

library(data.table)
all_products <- rbindlist(my.list, idcol = "product.id")[
  CJ(product.id = unique(product.id), Date = seq(min(Date), max(Date), by = "1 day")), 
  on = .(Date, product.id)]
all_products
  product.id Date Sales 1: 1 2016-04-08 5 2: 1 2016-04-09 NA 3: 1 2016-04-10 10 4: 1 2016-04-11 NA 5: 1 2016-04-12 15 6: 1 2016-04-13 NA 7: 1 2016-04-14 20 8: 1 2016-04-15 NA 9: 1 2016-04-16 25 10: 1 2016-04-17 NA 11: 2 2016-04-08 1 12: 2 2016-04-09 2 13: 2 2016-04-10 3 14: 2 2016-04-11 4 15: 2 2016-04-12 5 16: 2 2016-04-13 6 17: 2 2016-04-14 7 18: 2 2016-04-15 8 19: 2 2016-04-16 9 20: 2 2016-04-17 10 

Subsequent operations can be grouped by product.id , eg, to determine the number of valid sales data for each product: 后续操作可以按product.id进行分组,例如,以确定每种产品的有效销售数据的数量:

all_products[!is.na(Sales), .(valid.sales.data = .N), by = product.id]
  product.id valid.sales.data 1: 1 5 2: 2 10 

Or, the totals sales per product: 或者,每种产品的总销售额:

all_products[, .(total.sales = sum(Sales, na.rm = TRUE)), by = product.id]
  product.id total.sales 1: 1 75 2: 2 55 

If required for some reason the result can be converted back to a list by 如果出于某种原因需要,结果可以通过以下方式转换回列表

split(all_products, by = "product.id")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM