简体   繁体   English

R data.table:引用具有外部分配列名的数据表列

[英]R data.table: referring to data table column with externally assigned column name

I am reading in a large amount of files with monthly price information per product. 我正在阅读大量文件,每个产品的月度价格信息。

I want to obtain a data table merging all these files. 我想获得一个合并所有这些文件的数据表。

The key of this table is going to be the 2 columns with a product identifier and the date. 该表的关键是带有产品标识符和日期的2列。

A third column then contains the retail price. 然后第三列包含零售价。

In the source files each price column has a name of the format RETAILPRICE_[dd.mm.yyyy]. 在源文件中,每个价格列的名称格式为RETAILPRICE_ [dd.mm.yyyy]。

To prevent my final data table from containing a large amount of columns I need to rename the column with the retail price and to create a new column containing the date. 为了防止我的最终数据表包含大量列,我需要使用零售价重命名该列,并创建一个包含日期的新列。

The following code runs into an error because data.table does not understand the external reference to one of its columns. 以下代码遇到错误,因为data.table无法理解对其中一列的外部引用。

# this is how I obtain the list of files that have to be read in
# list the files
# files <- list.files(path = "path",
#                    pattern = "^Publications.*$",
#                    full.names = T)

# the data looks like this, although it is contained in an excel file.
# sample data
ProdID <- list(836187, 2398159, 2398165, 2398171, 2398188, 1800180, 2320105, 2320128, 2320140, 2320163, 1714888, 2516340)
RETAILPRICE_01.01.2003 <- c(12.50, 43.50, 65.50, 45.60, 69.45, 21.30, 81.15, 210.70, 405.00, 793.60, 116.50, 162.60)
Publications_per_2003.01.01 <- data.table(ProdID,RETAILPRICE_01.01.2003)

# uncomment if you want to write this to excel
# using .xls on purpose, because that's what they used back in the days
# xlsx::write.xlsx(Publications_per_2003.01.01,
#    "Publications_per_2003.01.01.xls",
#    row.names = F)
# files <- list.files(path = "path",
#                    pattern = "^Publications.*$",
#                    full.names = T)

# create data table
price_list <- data.table(
                 prodID = character(),
                 date = character(),
                 retail_price = numeric())


price_list <- lapply(files, function(x){

  # obtain date from file name
  # date in file name has the structure yyyy_mm_dd
  # while in the column name date has the structure
  # dd.mm.yyyy
  date <- substr(sapply(strsplit(x,"_"),"[",3),1,10)

  # obtain day, month and year separately
  day <- substr(date,9,10)
  month <- substr(date,6,7)
  year <- substr(date,1,4)

  # store the name of the column containing the retail price
  priceVar <- as.name(paste0("RETAILPRICE_",day,".",month,".",year))

  # read the xls file with the price info and in one go
  # keep only the relevant columns
  file <- data.table(read_excel(x))[
    ,.(prodID= as.character(ProdID),
       retail_price = priceVar,
       date = as.character(gsub("\\.","-",date)))#,with = F
    ]

  # merge the new file with the existing data table
  price_list <- merge(price_list,file,"ProdID")
})

This results in the error message 这会导致错误消息

Error in rep(x[[i]], length.out = mn) : 
  attempt to replicate an object of type 'symbol'

If I comment the part 如果我评论该部分

retail_price = priceVar,

there's no error. 没有错误。

So the problem lies in the reference to the column that somehow is not working. 所以问题在于对列的引用,不知何故不起作用。

I also tried 我也试过了

priceVar <- as.name(paste0("RETAILPRICE_",day,".",month,".",year))

file <- data.table(read_excel(x))

setnames(file, priceVar, "retail_price")

but I get the error (column name modified to fit the example): 但我得到错误(修改列名以适应示例):

Error in setnames(file, priceVar, "retail_price") : 
  Items of 'old' not found in column names: RETAILPRICE_dd.mm.yyyy.

If anyone could enlighten me I would be eternally grateful. 如果有人能够启发我,我会永远感激。

It could be nice if you provide an sample of the data you're working with so we could try your code with the data sample. 如果您提供您正在使用的数据样本可能会很好,这样我们就可以使用数据样本来尝试您的代码。 Also I read your code and on this line : 我也读了你的代码并在这一行:

price_list <- merge(prijslijst,file,"ProdID")

You never mentioned the variable "prijslijst" so maybe the problem is there. 你从来没有提到变量“prijslijst”所以也许问题就在那里。

In this case it will be much easier to work with plain data frames rather than data.table's. 在这种情况下,使用普通数据框而不是data.table是更容易的。

price_list <- lapply(files, function(x){
  date <- substr(sapply(strsplit(x,"_"),"[",3),1,10)

  day <- substr(date,9,10)
  month <- substr(date,6,7)
  year <- substr(date,1,4)

  # make it a character, not a name
  priceVar <- paste0("RETAILPRICE_",day,".",month,".",year)

  one_df <- readxl::read_excel(x)[, c("ProdID", priceVar)]
  colnames(one_df) <- c("prodID", "retail_price")
  one_df$prodID = as.character(one_df$prodID) # NB: as.integer would be much more efficient, but be careful for values above 2.0e9
  one_df$date = as.character(gsub("\\.","-",date))

  one_df
})

# Watch out: this will pile up the records from all files
# In your initial code you were using merge(...) which computes the intersection
price_list <- do.call(rbind, price_list)

# Optional:
data.table::setDT(price_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM