[英]How to extract information and perform the same operation on multiple similar files in R?
I have several hundred files, each of which represent prices for a particular stock, and I want to loop through them, calculate log return, and add the log return as a column in a data frame containing log returns for all of the stocks. 我有几百个文件,每个文件代表特定股票的价格,我想遍历它们,计算对数收益,并将对数收益作为列添加到包含所有股票对数收益的数据框中。
Essentially, I have something like this, say I have three csvs that are named "a.csv", "b.csv" and "c.csv", and they look something like (the numbers below are totally fabricated, the idea is just that the dates are not necessarily the same, nor are the files the same length, but they have the same columns and names): 本质上,我有类似这样的内容,比如说我有三个分别名为“ a.csv”,“ b.csv”和“ c.csv”的csv,它们看起来像(以下数字完全是虚构的,只是日期不一定相同,文件长度也不相同,但是它们具有相同的列和名称):
a.csv: a.csv:
Date Adj.Close
1/1/2001 5
1/2/2001 5.25
1/3/2001 5.17
1/4/2001 5.09
1/5/2001 5.83
b.csv: b.csv:
Date Adj.Close
3/17/2005 17.85
3/18/2005 19.20
3/19/2005 18.55
3/20/2005 18.45
c.csv: c.csv:
Date Adj.Close
5/9/1995 25.39
5/10/1995 25
5/11/1995 25.83
5/12/1995 24.99
5/13/1995 28
5/16/1995 27.17
5/17/1995 26.95
I know how to calculate log returns for one file (the below works fine for one file): 我知道如何计算一个文件的日志返回值(以下对一个文件有效):
setwd('my_wd')
data <- read.csv('a.csv')
attach(data)
n = dim(data)[1]
log_rtn = diff(log(Adj.Close))
That gives me a list of the log returns for the first csv. 这给了我第一个csv的日志返回列表。 What I want to do (in pseudo code) is:
我想做的(用伪代码)是:
for file in my_wd:
data <- file_name.csv
attach(data)
n = dim(data)[1]
file_name_log_rtn = diff(log(Adj.Close))
in order to return lists of log returns named in the same was as the csv (in pseudo-output), something like (named after the file, as below): 为了返回以csv命名的日志返回列表(在伪输出中),类似于(以文件命名,如下所示):
a_log_rtn: a_log_rtn:
0.048790164, -0.015355388,-0.015594858,0.13573917
b_log_rtn: b_log_rtn:
0.072906771, -0.03444049,-0.005405419
c_log_rtn: c_log_rtn:
-0.015479571,0.032660782,-0.033060862,0.113728765,-0.030091087,-0.008130126
Foreword: Do not use attach
, you have nothing to gain from it and it is potentially harmful. 前言:不要使用
attach
,您将无法从中获益,它可能有害。
Without access to your files I have not tested the code below but I would do something along the lines of it. 在无法访问您的文件的情况下,我尚未测试下面的代码,但我会按照其内容进行操作。
The trick is to use lapply
to process all the files in a loop. 诀窍是使用
lapply
循环处理所有文件。 I use it twice, one time to read in the data and the second to create a new column with the log returns. 我使用了两次,一次是读入数据,第二次是用日志返回值创建一个新列。
olddir <- setwd('my_wd')
files_list <- list.files(pattern = "*\\.csv")
data_list <- lapply(files_list, read.csv)
data_list <- lapply(data_list, function(DF){
DF[["log_rtn"]] <- c(NA, diff(log(DF[["Adj.Close"]])))
DF
})
# reset the old directory if you want
#setwd(olddir)
Note that the column log_rtn
will have NA
as the first value. 请注意,列
log_rtn
将以NA
作为第一个值。 You can change this to 0
if you want but I believe that the NA
makes more sense. 您可以根据需要将其更改为
0
,但我相信NA
更有意义。
allfiles=list.files(path_to_the_files_here,pattern = "\\.csv")
listdata=lapply(allfiles,function(x)transform(read.csv(x),log_Adj.Close=log(Adj.Close)))
If you want you can list these to the environment: 如果需要,可以将它们列出到环境中:
list2env(setNames(listdata,gsub(".*(.)(\\.csv)","\\1",allfiles)))
Put the files in a directory, say it is called csv_dir
. 将文件放在一个名为
csv_dir
的目录中。
csv_list <- list.files(csv_dir, pattern = "csv", full.names = T)
names(csv_list) <- basename(csv_list)
log_diffs <- lapply(csv_list, function(t) {tcsv <- read.csv(t)
diff(log(tcsv$Adj.Close)
})
This will produce a list log_diffs
with what you want. 这将生成一个列表
log_diffs
与您想要的。 To see the results from a particular file you can use log_diff[["a.csv"]]
for example. 要查看特定文件的结果,您可以使用
log_diff[["a.csv"]]
。 If you want to put all the results in one big data frame, with one column for the file name and another with the log differences, you could do the following: 如果要将所有结果放在一个大数据框中,其中一栏为文件名,另一栏为日志差异,则可以执行以下操作:
log_diffs <- lapply(csv_list, function(t) {tcsv <- read.csv(t)
data.frame(file = rep(basename(t)),
log.diff = diff(log(tcsv$Adj.Close),
stringsAsFactors = F)})
csv_log_diffs <- do.call(rbind(log_diffs))
If your csv files are very large, you could consider using read_csv
from the readr
package, it will be faster than read.csv
, and provide a progress bar. 如果您的CSV文件非常大,你可以考虑使用
read_csv
从readr
包,它会比快read.csv
,并提供一个进度条。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.