简体   繁体   English

在文件夹中的多个 .txt 文件上应用 R 脚本

[英]Apply an R script over multiple .txt files in a folder

I am extremely new to building functions and loops.我对构建函数和循环非常陌生。 I have looked at previous questions that are similar to my issue but I can't seem to find the solution for my problem.我查看了与我的问题类似的以前的问题,但似乎找不到解决我的问题的方法。 My goal is to extract climate data from a webpage like this:我的目标是从这样的网页中提取气候数据:

https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019 https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd&model=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50apsim&what=gdd_50apsim&what=逗号&gis=no&scenario_year=2019

where I will use this data to calculate growing degree days for a crop growth model.我将使用这些数据来计算作物生长模型的生长期天数。 I have had success pulling data using a for loop.我已经成功地使用 for 循环提取数据。

uticaNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE8745&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"

friendNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"

location.urls <- c(uticaNE, friendNE)
location.meso.files <- c("uticaNe.txt", "friendNE.txt")

for(i in seq_along(location.urls)){
  download.file(location.urls[i], location.meso.files[i], method="libcurl")
} 

I will have around 20 locations I will be pulling data in daily.我每天将有大约 20 个位置提取数据。 What I want to do is apply a task where I calculate fahrenheit, GDD, etc. to each file and save the output of each file separately.我想要做的是将计算华氏度、GDD 等的任务应用到每个文件,并分别保存每个文件的输出。

This is the following code I have currently.这是我目前拥有的以下代码。

files <- list.files(pattern="*.txt", full.names=TRUE, recursive=FALSE)

  func <- for (i in 1:length(files)){
  df <- read.table(files[i], skip=10, stringsAsFactors = 
  FALSE)
  colnames(df) <- c("year", "day", "solrad", "maxC", 
  "minC", "precipmm")
  df$year <- as.f(df$year)
  df$day <- as.factor(df$day)
  df$maxF <- (df$maxC * (9/5) + 32)
  df$minF <- (df$minC * (9/5) + 32)
  df$GDD <- (((df$maxF + df$minF)/2)-50)
  df$GDD[df$GDD <= 0] <- 0
  df$GDD.cumulateive <- cumsum(df$GDD)
  df$precipmm.cumulative <- cumsum(df$precipmm)
  return(df)
  write.table(df, path="./output", quote=FALSE, 
  row.names=FALSE, col.names=TRUE)
}

data <- apply(files, func)

Any help would be greatly appreciated.任何帮助将不胜感激。

-ML -ML

Instead of using base R which ,you can install tidyverse library.您可以安装 tidyverse 库,而不是使用 base R which 。 https://www.tidyverse.org/ In which you can use load the link into data frame as tsv(tab separated value) using read_tsv function. https://www.tidyverse.org/在其中您可以使用 read_tsv 函数将链接作为 tsv(制表符分隔值)加载到数据框中。

dataframe<-read_tsv(url("http://some.where.net/"))

Then create a loop in R and do calculations然后在R中创建一个循环并进行计算

something<-c('link1','link2') #vector in R
for(i in someting){
 #make sure to indent with one space
}

At the end, you save data frame to a file using最后,您使用以下命令将数据框保存到文件中

write_csv(dataframe, file = "c:\\myname\\yourfile.csv")

Here is an approach using base R, and lapply() with an anonymous function to download the data, read it into a data frame, add the conversions to fahrenheit and cumulative precipitation, and write to output files.这是一种使用基本 R 和带有匿名函数的lapply()的方法来下载数据,将其读入数据框,将转换添加到华氏度和累积降水量,然后写入输出文件。

First, we create the list of weather stations for which we will download data首先,我们创建将下载数据的气象站列表

# list of 10 stations
stationList <- c("NE3065","NE8745","NE0030","NE0050","NE0130",
                 "NE0245","NE0320","NE0355","NE0375","NE0420")

Here we create two URL fragments, one for the URL content prior to the station identifier, and another one for the URL content after the station identifier.这里我们创建了两个 URL 片段,一个用于站点标识符之前的 URL 内容,另一个用于站点标识符之后的 URL 内容。

urlFragment1 <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations="
urlFragment2 <- "&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year"

Next, we create input and output directories, one to store the downloaded climate input files, and another for the output files.接下来,我们创建输入和输出目录,一个用于存储下载的气候输入文件,另一个用于输出文件。

# create input and output file directories if they do not already exist 
if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/output")) dir.create("./data/output")

The lapply() function uses paste0() to add the station names to the URL fragments we created above, enabling us to automate the download and subsequent operations against each input file. lapply()函数使用paste0()将电台名称添加到我们上面创建的 URL 片段中,使我们能够针对每个输入文件自动进行下载和后续操作。

stationData <- lapply(stationList,function(x){
     theURL <-paste0(urlFragment1,x,urlFragment2)
     download.file(theURL,
                   paste0("./data/",x,".txt"),method="libcurl")
     df <- read.table(paste0("./data/",x,".txt"), skip=11, stringsAsFactors = 
                           FALSE)
     colnames(df) <- c("year", "day", "solrad", "maxC", 
                       "minC", "precipmm")
     df$year <- as.factor(df$year)
     df$day <- as.factor(df$day)
     df$maxF <- (df$maxC * (9/5) + 32)
     df$minF <- (df$minC * (9/5) + 32)
     df$GDD <- (((df$maxF + df$minF)/2)-50)
     df$GDD[df$GDD <= 0] <- 0
     df$GDD.cumulative <- cumsum(df$GDD)
     df$precipmm.cumulative <- cumsum(df$precipmm)
     df$station <- x
     write.table(df,file=paste0("./data/output/",x,".txt"), quote=FALSE, 
                 row.names=FALSE, col.names=TRUE)
     df
})
# add names to the data frames returned by lapply()
names(stationData) <- stationList

...and the output, a directory containing one file for each station listed in the stationList object. ...和输出,一个目录,包含一个文件,用于在stationList对象中列出的每个站。

在此处输入图片说明

Finally, here is the data that has been written to the ./data/output/NE3065.txt file.最后,这里是已经写入./data/output/NE3065.txt文件的数据。

year day solrad maxC minC precipmm maxF minF GDD GDD.cumulateive precipmm.cumulative station
2020 1 8.992 2.2 -5 0 35.96 23 0 0 0 NE3065
2020 2 9.604 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 3 4.933 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 4 8.699 3.9 -7.2 0 39.02 19.04 0 0 0 NE3065
2020 5 9.859 6.1 -7.8 0 42.98 17.96 0 0 0 NE3065
2020 6 10.137 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 7 8.754 6.1 -4.4 0 42.98 24.08 0 0 0 NE3065
2020 8 10.121 7.8 -5 0 46.04 23 0 0 0 NE3065
2020 9 9.953 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 10 8.905 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 11 0.416 -3.9 -15.6 2.29 24.98 3.92 0 0 2.29 NE3065
2020 12 10.694 -4.4 -16.1 0 24.08 3.02 0 0 2.29 NE3065
2020 13 1.896 -4.4 -11.1 0.51 24.08 12.02 0 0 2.8 NE3065
2020 14 0.851 0 -7.8 0 32 17.96 0 0 2.8 NE3065
2020 15 11.043 -1.1 -8.9 0 30.02 15.98 0 0 2.8 NE3065
2020 16 10.144 -2.8 -17.2 0 26.96 1.04 0 0 2.8 NE3065
2020 17 10.75 -5.6 -17.2 3.05 21.92 1.04 0 0 5.85 NE3065

Note that there are 11 rows of header data in the input files, so one must set the skip= argument in read.table() to 11, not 10 as was used in the OP.请注意,输入文件中有 11 行标题数据,因此必须将read.table()skip=参数设置为 11,而不是 OP 中使用的 10。

Enhancing the code增强代码

The last line in the anonymous function returns the data frame to the parent environment, resulting in a list of 10 data frames stored in the stationData object.匿名函数中的最后一行将数据帧返回给父环境,从而生成一个包含 10 个数据帧的列表,该列表存储在stationData对象中。 Since we assigned the station name to a column in each data frame, we can combine the data frames into a single data frame for subsequent analysis, using do.call() with rbind() as follows.由于我们将站名分配给每个数据帧中的一列,因此我们可以将数据帧合并为单个数据帧以供后续分析,使用do.call()rbind()如下。

combinedData <- do.call(rbind,stationData)

Since this code was run on January 17th, the resulting data frame contains 170 observations, or 17 observations for each of the 10 stations whose data we downloaded.由于此代码是在 1 月 17 日运行的,因此生成的数据框包含 170 个观测值,或者我们下载数据的 10 个站点中的每个观测站的 17 个观测值。

At this point the data can be analyzed by station, such as finding the average year to date precipitation by station.此时可以按站对数据进行分析,例如按站查找年迄今的平均降水量。

> aggregate(precipmm ~ station,combinedData,mean)
   station   precipmm
1   NE0030 0.01470588
2   NE0050 0.56764706
3   NE0130 0.32882353
4   NE0245 0.25411765
5   NE0320 0.28411765
6   NE0355 1.49411765
7   NE0375 0.55235294
8   NE0420 0.13411765
9   NE3065 0.34411765
10  NE8745 0.47823529
> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM