在文件夹中的多个 .txt 文件上应用 R 脚本

Question

我对构建函数和循环非常陌生。 我查看了与我的问题类似的以前的问题，但似乎找不到解决我的问题的方法。 我的目标是从这样的网页中提取气候数据：

https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd&model=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50apsim&what=gdd_50apsim&what=逗号&gis=no&scenario_year=2019

我将使用这些数据来计算作物生长模型的生长期天数。 我已经成功地使用 for 循环提取数据。

uticaNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE8745&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"

friendNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"

location.urls <- c(uticaNE, friendNE)
location.meso.files <- c("uticaNe.txt", "friendNE.txt")

for(i in seq_along(location.urls)){
  download.file(location.urls[i], location.meso.files[i], method="libcurl")
}

我每天将有大约 20 个位置提取数据。 我想要做的是将计算华氏度、GDD 等的任务应用到每个文件，并分别保存每个文件的输出。

这是我目前拥有的以下代码。

files <- list.files(pattern="*.txt", full.names=TRUE, recursive=FALSE)

  func <- for (i in 1:length(files)){
  df <- read.table(files[i], skip=10, stringsAsFactors = 
  FALSE)
  colnames(df) <- c("year", "day", "solrad", "maxC", 
  "minC", "precipmm")
  df$year <- as.f(df$year)
  df$day <- as.factor(df$day)
  df$maxF <- (df$maxC * (9/5) + 32)
  df$minF <- (df$minC * (9/5) + 32)
  df$GDD <- (((df$maxF + df$minF)/2)-50)
  df$GDD[df$GDD <= 0] <- 0
  df$GDD.cumulateive <- cumsum(df$GDD)
  df$precipmm.cumulative <- cumsum(df$precipmm)
  return(df)
  write.table(df, path="./output", quote=FALSE, 
  row.names=FALSE, col.names=TRUE)
}

data <- apply(files, func)

任何帮助将不胜感激。

-ML

Answer 1

您可以安装 tidyverse 库，而不是使用 base R which 。 https://www.tidyverse.org/在其中您可以使用 read_tsv 函数将链接作为 tsv（制表符分隔值）加载到数据框中。

dataframe<-read_tsv(url("http://some.where.net/"))

然后在R中创建一个循环并进行计算

something<-c('link1','link2') #vector in R
for(i in someting){
 #make sure to indent with one space
}

最后，您使用以下命令将数据框保存到文件中

write_csv(dataframe, file = "c:\\myname\\yourfile.csv")

Answer 2

这是一种使用基本 R 和带有匿名函数的lapply()的方法来下载数据，将其读入数据框，将转换添加到华氏度和累积降水量，然后写入输出文件。

首先，我们创建将下载数据的气象站列表

# list of 10 stations
stationList <- c("NE3065","NE8745","NE0030","NE0050","NE0130",
                 "NE0245","NE0320","NE0355","NE0375","NE0420")

这里我们创建了两个 URL 片段，一个用于站点标识符之前的 URL 内容，另一个用于站点标识符之后的 URL 内容。

urlFragment1 <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations="
urlFragment2 <- "&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year"

接下来，我们创建输入和输出目录，一个用于存储下载的气候输入文件，另一个用于输出文件。

# create input and output file directories if they do not already exist 
if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/output")) dir.create("./data/output")

lapply()函数使用paste0()将电台名称添加到我们上面创建的 URL 片段中，使我们能够针对每个输入文件自动进行下载和后续操作。

stationData <- lapply(stationList,function(x){
     theURL <-paste0(urlFragment1,x,urlFragment2)
     download.file(theURL,
                   paste0("./data/",x,".txt"),method="libcurl")
     df <- read.table(paste0("./data/",x,".txt"), skip=11, stringsAsFactors = 
                           FALSE)
     colnames(df) <- c("year", "day", "solrad", "maxC", 
                       "minC", "precipmm")
     df$year <- as.factor(df$year)
     df$day <- as.factor(df$day)
     df$maxF <- (df$maxC * (9/5) + 32)
     df$minF <- (df$minC * (9/5) + 32)
     df$GDD <- (((df$maxF + df$minF)/2)-50)
     df$GDD[df$GDD <= 0] <- 0
     df$GDD.cumulative <- cumsum(df$GDD)
     df$precipmm.cumulative <- cumsum(df$precipmm)
     df$station <- x
     write.table(df,file=paste0("./data/output/",x,".txt"), quote=FALSE, 
                 row.names=FALSE, col.names=TRUE)
     df
})
# add names to the data frames returned by lapply()
names(stationData) <- stationList

...和输出，一个目录，包含一个文件，用于在stationList对象中列出的每个站。

最后，这里是已经写入./data/output/NE3065.txt文件的数据。

year day solrad maxC minC precipmm maxF minF GDD GDD.cumulateive precipmm.cumulative station
2020 1 8.992 2.2 -5 0 35.96 23 0 0 0 NE3065
2020 2 9.604 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 3 4.933 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 4 8.699 3.9 -7.2 0 39.02 19.04 0 0 0 NE3065
2020 5 9.859 6.1 -7.8 0 42.98 17.96 0 0 0 NE3065
2020 6 10.137 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 7 8.754 6.1 -4.4 0 42.98 24.08 0 0 0 NE3065
2020 8 10.121 7.8 -5 0 46.04 23 0 0 0 NE3065
2020 9 9.953 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 10 8.905 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 11 0.416 -3.9 -15.6 2.29 24.98 3.92 0 0 2.29 NE3065
2020 12 10.694 -4.4 -16.1 0 24.08 3.02 0 0 2.29 NE3065
2020 13 1.896 -4.4 -11.1 0.51 24.08 12.02 0 0 2.8 NE3065
2020 14 0.851 0 -7.8 0 32 17.96 0 0 2.8 NE3065
2020 15 11.043 -1.1 -8.9 0 30.02 15.98 0 0 2.8 NE3065
2020 16 10.144 -2.8 -17.2 0 26.96 1.04 0 0 2.8 NE3065
2020 17 10.75 -5.6 -17.2 3.05 21.92 1.04 0 0 5.85 NE3065

请注意，输入文件中有 11 行标题数据，因此必须将read.table()的skip=参数设置为 11，而不是 OP 中使用的 10。

增强代码

匿名函数中的最后一行将数据帧返回给父环境，从而生成一个包含 10 个数据帧的列表，该列表存储在stationData对象中。 由于我们将站名分配给每个数据帧中的一列，因此我们可以将数据帧合并为单个数据帧以供后续分析，使用do.call()和rbind()如下。

combinedData <- do.call(rbind,stationData)

由于此代码是在 1 月 17 日运行的，因此生成的数据框包含 170 个观测值，或者我们下载数据的 10 个站点中的每个观测站的 17 个观测值。

此时可以按站对数据进行分析，例如按站查找年迄今的平均降水量。

> aggregate(precipmm ~ station,combinedData,mean)
   station   precipmm
1   NE0030 0.01470588
2   NE0050 0.56764706
3   NE0130 0.32882353
4   NE0245 0.25411765
5   NE0320 0.28411765
6   NE0355 1.49411765
7   NE0375 0.55235294
8   NE0420 0.13411765
9   NE3065 0.34411765
10  NE8745 0.47823529
>

在文件夹中的多个 .txt 文件上应用 R 脚本

问题描述

2 个解决方案

解决方案1
0 2020-01-17 21:08:36

解决方案2
0 已采纳 2020-01-18 14:25:33

增强代码

在文件夹中的多个 .txt 文件上应用 R 脚本

问题描述

2 个解决方案

解决方案1 0 2020-01-17 21:08:36

解决方案2 0 已采纳 2020-01-18 14:25:33

增强代码

解决方案1
0 2020-01-17 21:08:36

解决方案2
0 已采纳 2020-01-18 14:25:33