简体   繁体   English

R:如何循环读取 csv 文件并通过匹配 R 中的参考数据表中的行来提取每个文件中的信息

[英]R: how to loop read csv files and extract information within each file by matching rows in a reference datatable in R

I have a table of reference stock symbol (20,000 rows)我有一张参考股票代码表(20,000 行)

在此处输入图像描述

and a folder of csv files, each CSV files is named by a stock symbol, for example ZTS.csv.和一个包含csv个文件的文件夹,每个CSV个文件以股票代码命名,例如ZTS.csv。 Inside each csv file, is the price history of the symbol.在每个 csv 文件中,是交易品种的价格历史。

在此处输入图像描述

The end goal is to track performances of all stocks and visualisation.最终目标是跟踪所有股票的表现和可视化。 Because of the sheer size of the reference table and the csv files, I think the most sensible approach will be selecting need information from each CSV files and add them into the reference table.由于参考表和 csv 个文件的庞大规模,我认为最明智的方法是从每个 CSV 个文件中选择需要的信息并将它们添加到参考表中。

For example, I would want to take a row from the reference table, symbol ZTS, showdate 2017-01-09,例如,我想从参考表中取出一行,符号 ZTS,showdate 2017-01-09,

Then read the ZTE.csv file, find the rows with date matching the showdates, add the open/high/low/close price data columns然后读取ZTE.csv文件,找到日期匹配showdates的行,添加open/high/low/close price数据列

Then loop this.然后循环这个。

Due to size restrictions, I have uploaded sample data here on google drive: https://drive.google.com/drive/folders/1G3os67b2i2VfGHnvR6NX8qk1ECuVawGJ?usp=sharing由于大小限制,我已将示例数据上传到谷歌驱动器: https://drive.google.com/drive/folders/1G3os67b2i2VfGHnvR6NX8qk1ECuVawGJ?usp=sharing

#read in the reference data

df <- read.csv("reference table.csv", header = TRUE) 


# get csv files directory and list all files in this directory

wd <- "/Users/m/Desktop/project/price_data_csv"
files_in_wd <- list.files(wd)


#find stuff to match

# create an empty list and read in all files from wd
mylist <- list()
for(i in seq_along(files_in_wd)){
  
  mylist[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ',',
                             header = T)
}

I'm stuck on how to do the matching and creating combined table.我坚持如何进行匹配和创建组合表。 Thank you谢谢

I'd recommend using data.table , because, as @r2evans mentions, it does grouping well, and if your data is large, it is very fast.我建议使用data.table ,因为正如@r2evans 提到的那样,它可以很好地分组,而且如果您的数据很大,它会非常快。

Using your sample data, the could bekow should hopefully get you started (I've prefixed the data.table methods to help indicate where it's being used).使用您的示例数据,可能的 bekow 应该可以帮助您入门(我已经在data.table方法前面加上前缀以帮助指示它的使用位置)。 You can use the provided function on an individual symbol, or try running it all at once (not sure how big your data actually is).您可以在单个符号上使用提供的 function,或者尝试一次运行它(不确定您的数据实际有多大)。

library(data.table)

data_dir <- "~/Downloads/Testing/"
reference_table <- data.table::fread(paste0(data_dir, "reference table.csv"));

prepare_symbol_table <- function(sym, ref) {
  # This check is only necessary if calling individually
  if(data.table::uniqueN(ref$symbol) > 1) 
    ref <- ref[symbol == sym]

  symbol_csv <- data.table::fread(paste0(data_dir, sym, ".csv"))
  data.table::merge.data.table(ref, symbol_csv, by.x = c("showdate"), by.y = c("date"))  
}

# merge a single symbol table
yum_table <- prepare_symbol_table("YUM", reference_table)

# all merged at once, reading individual CSVs by matching the symbol column from
# the reference table
all_symbols_merged <- reference_table[, {
  # symbol_csv <- data.table::fread(paste0(data_dir, symbol, ".csv"))
  # data.table::merge.data.table(.SD, symbol_csv, by.x = c("showdate"), by.y = c("date"))
  prepare_symbol_table(.BY, .SD)
}, by = c("symbol")]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM