R data.table：引用具有外部分配列名的数据表列

Question

I am reading in a large amount of files with monthly price information per product. 我正在阅读大量文件，每个产品的月度价格信息。

I want to obtain a data table merging all these files. 我想获得一个合并所有这些文件的数据表。

The key of this table is going to be the 2 columns with a product identifier and the date. 该表的关键是带有产品标识符和日期的2列。

A third column then contains the retail price. 然后第三列包含零售价。

In the source files each price column has a name of the format RETAILPRICE_[dd.mm.yyyy]. 在源文件中，每个价格列的名称格式为RETAILPRICE_ [dd.mm.yyyy]。

To prevent my final data table from containing a large amount of columns I need to rename the column with the retail price and to create a new column containing the date. 为了防止我的最终数据表包含大量列，我需要使用零售价重命名该列，并创建一个包含日期的新列。

The following code runs into an error because data.table does not understand the external reference to one of its columns. 以下代码遇到错误，因为data.table无法理解对其中一列的外部引用。

# this is how I obtain the list of files that have to be read in
# list the files
# files <- list.files(path = "path",
#                    pattern = "^Publications.*$",
#                    full.names = T)

# the data looks like this, although it is contained in an excel file.
# sample data
ProdID <- list(836187, 2398159, 2398165, 2398171, 2398188, 1800180, 2320105, 2320128, 2320140, 2320163, 1714888, 2516340)
RETAILPRICE_01.01.2003 <- c(12.50, 43.50, 65.50, 45.60, 69.45, 21.30, 81.15, 210.70, 405.00, 793.60, 116.50, 162.60)
Publications_per_2003.01.01 <- data.table(ProdID,RETAILPRICE_01.01.2003)

# uncomment if you want to write this to excel
# using .xls on purpose, because that's what they used back in the days
# xlsx::write.xlsx(Publications_per_2003.01.01,
#    "Publications_per_2003.01.01.xls",
#    row.names = F)
# files <- list.files(path = "path",
#                    pattern = "^Publications.*$",
#                    full.names = T)

# create data table
price_list <- data.table(
                 prodID = character(),
                 date = character(),
                 retail_price = numeric())


price_list <- lapply(files, function(x){

  # obtain date from file name
  # date in file name has the structure yyyy_mm_dd
  # while in the column name date has the structure
  # dd.mm.yyyy
  date <- substr(sapply(strsplit(x,"_"),"[",3),1,10)

  # obtain day, month and year separately
  day <- substr(date,9,10)
  month <- substr(date,6,7)
  year <- substr(date,1,4)

  # store the name of the column containing the retail price
  priceVar <- as.name(paste0("RETAILPRICE_",day,".",month,".",year))

  # read the xls file with the price info and in one go
  # keep only the relevant columns
  file <- data.table(read_excel(x))[
    ,.(prodID= as.character(ProdID),
       retail_price = priceVar,
       date = as.character(gsub("\\.","-",date)))#,with = F
    ]

  # merge the new file with the existing data table
  price_list <- merge(price_list,file,"ProdID")
})

This results in the error message 这会导致错误消息

Error in rep(x[[i]], length.out = mn) : 
  attempt to replicate an object of type 'symbol'

If I comment the part 如果我评论该部分

retail_price = priceVar,

there's no error. 没有错误。

So the problem lies in the reference to the column that somehow is not working. 所以问题在于对列的引用，不知何故不起作用。

I also tried 我也试过了

priceVar <- as.name(paste0("RETAILPRICE_",day,".",month,".",year))

file <- data.table(read_excel(x))

setnames(file, priceVar, "retail_price")

but I get the error (column name modified to fit the example): 但我得到错误（修改列名以适应示例）：

Error in setnames(file, priceVar, "retail_price") : 
  Items of 'old' not found in column names: RETAILPRICE_dd.mm.yyyy.

If anyone could enlighten me I would be eternally grateful. 如果有人能够启发我，我会永远感激。

Answer 1

It could be nice if you provide an sample of the data you're working with so we could try your code with the data sample. 如果您提供您正在使用的数据样本可能会很好，这样我们就可以使用数据样本来尝试您的代码。 Also I read your code and on this line : 我也读了你的代码并在这一行：

price_list <- merge(prijslijst,file,"ProdID")

You never mentioned the variable "prijslijst" so maybe the problem is there. 你从来没有提到变量“prijslijst”所以也许问题就在那里。

Answer 2

In this case it will be much easier to work with plain data frames rather than data.table's. 在这种情况下，使用普通数据框而不是data.table是更容易的。

price_list <- lapply(files, function(x){
  date <- substr(sapply(strsplit(x,"_"),"[",3),1,10)

  day <- substr(date,9,10)
  month <- substr(date,6,7)
  year <- substr(date,1,4)

  # make it a character, not a name
  priceVar <- paste0("RETAILPRICE_",day,".",month,".",year)

  one_df <- readxl::read_excel(x)[, c("ProdID", priceVar)]
  colnames(one_df) <- c("prodID", "retail_price")
  one_df$prodID = as.character(one_df$prodID) # NB: as.integer would be much more efficient, but be careful for values above 2.0e9
  one_df$date = as.character(gsub("\\.","-",date))

  one_df
})

# Watch out: this will pile up the records from all files
# In your initial code you were using merge(...) which computes the intersection
price_list <- do.call(rbind, price_list)

# Optional:
data.table::setDT(price_list)

R data.table：引用具有外部分配列名的数据表列

问题描述

2 个解决方案

解决方案1
0 2019-05-17 10:13:58

解决方案2
0 已采纳 2019-05-17 11:25:32

R data.table：引用具有外部分配列名的数据表列

问题描述

2 个解决方案

解决方案1 0 2019-05-17 10:13:58

解决方案2 0 已采纳 2019-05-17 11:25:32

解决方案1
0 2019-05-17 10:13:58

解决方案2
0 已采纳 2019-05-17 11:25:32