简体   繁体   English

R:合并> 1000个.dat文件

[英]R: Merge >1000 .dat files

I am trying to merge >30,000 individual .dat files of climate data across the UK (with sequential names of met*****.dat) into one single data file that I can then extract annual values from. 我正在尝试将全英国超过30,000个单独的.dat气候数据文件(序列名称为met *****。dat)合并到一个单一的数据文件中,然后从中提取年度值。 The data is formatted such that the first row in each file contains the lat long for that cell and then underneath that are the climate variables measured: 数据经过格式化,以使每个文件的第一行包含该单元格的经度,然后在其下是所测量的气候变量:

54.78   -7.06
 iGrid     x      y      year     month   SRad        Tmin       Tmax          vp        wind        rain        sunshine
19028      56     139    1971       1        2.22        1.25        6.43        7.29       12.33      133.47        1.72
19028      56     139    1971       2        3.94        1.26        6.74        7.36       11.11      106.64        1.85
19028      56     139    1971       3        7.30        0.95        7.56        7.05       12.77       57.26        2.57
19028      56     139    1971       4       12.09        2.29       10.07        7.71       11.02      112.81        4.24
19028      56     139    1971       5       15.51        4.85       13.07        9.20       13.94      103.08        5.02
19028      56     139    1971       6       16.18        6.36       14.09        9.98       12.44      115.03        4.76
19028      56     139    1971       7       15.59        9.25       17.56       12.83       10.45      107.18        4.73
19028      56     139    1971       8       11.92        9.20       15.98       12.70       11.39      136.40        3.35
19028      56     139    1971       9        9.69        7.85       16.31       12.72       12.24       41.76        4.11
19028      56     139    1971      10        5.43        6.96       12.88       10.56       16.55       99.72        2.90
19028      56     139    1971      11        2.55        2.78        8.36        8.39       17.98      195.24        1.56
19028      56     139    1971      12        1.51        3.29        8.02        8.68       16.74       54.63        0.89
19028      56     139    1972       1        1.96       -0.30        4.79        6.78       16.55      168.39        1.07
19028      56     139    1972       2        4.23        0.29        5.78        6.97       15.97      114.47        2.33
19028      56     139    1972       3        7.80        0.53        7.54        7.71       14.02      165.09        3.14
19028      56     139    1972       4       12.51        2.80       10.32        8.26       13.39      116.30        4.63
19028      56     139    1972       5       14.49        4.46       11.50        8.98       15.54      187.31        4.17
19028      56     139    1972       6       16.95        5.36       12.90        9.50       12.45      123.17        5.39
19028      56     139    1972       7       15.35        8.61       16.81       12.63        7.41      105.49        4.53
19028      56     139    1972       8       11.40        8.04       15.17       11.66       11.85       86.44        2.88
19028      56     139    1972       9        9.60        5.54       14.03       10.19        9.96       24.70        4.01
19028      56     139    1972      10        5.29        6.04       12.06       10.27       12.92       79.35        2.70
19028      56     139    1972      11        2.49        1.74        7.09        8.02       14.79      169.47        1.43
19028      56     139    1972      12        1.55        1.80        7.45        8.00       17.66      153.31        1.01
19028      56     139    1973       1        1.73        2.16        6.32        7.98       15.25      202.89        0.52
19028      56     139    1973       2        4.08        0.27        5.47        7.17       14.71      126.04        2.07
19028      56     139    1973       3        8.05        0.90        8.30        7.24       13.64       49.84        3.43
19028      56     139    1973       4       12.54        1.98        9.05        7.27       13.02       86.45        4.66
19028      56     139    1973       5       14.55        4.71       12.37        9.41       13.18      102.20        4.22
19028      56     139    1973       6       16.05        7.42       16.17       11.17       10.93       64.45        4.65
19028      56     139    1973       7       13.72        9.66       16.39       12.82        8.16      111.27        3.18
19028      56     139    1973       8       12.23        9.62       17.07       12.93       10.37       97.53        3.62
19028      56     139    1973       9        9.11        7.95       15.02       11.65       12.63      116.15        3.49
19028      56     139    1973      10        5.23        4.51       10.78        9.45       10.27       71.68        2.61
19028      56     139    1973      11        2.37        1.84        7.63        7.91       14.48      162.53        1.17
19028      56     139    1973      12        1.56        1.19        6.79        7.72       14.65      174.89        1.03

Reading previous answers ( This and this ) has helped me arrive at this code: 阅读先前的答案( Thisthis )帮助我找到了以下代码:

library(data.table)

filenames <- list.files(path="/mydirectory",
                    pattern = ".dat", full.names=T)

temp <- lapply(filenames, fread, sep=",")
data <- rbindlist(temp)

write.table(data,"metcombi.dat", sep=",")

Which spits out a huge .dat file (the input files occupy ~2.5GB) but at the end of each file it repeats the column headings and erases the lat long values. 这会吐出一个巨大的.dat文件(输入文件约占2.5GB),但是在每个文件的末尾,它会重复列标题并删除lat long值。

1) How do I stop it from repeating the column headers for each file I'm merging? 1)如何阻止我为要合并的每个文件重复列标题? 2) How can I extract the lat long values from the first row in each file and move them to new columns? 2)如何从每个文件的第一行提取经纬度值并将其移至新列?

Thank you! 谢谢!

Consider expanding your lapply() to do a quick read of first line, read in .dat file into dataframe, and then split lat/long first lines to columns: 考虑扩展lapply()来快速读取第一行,将.dat文件读入数据帧,然后将经/纬度的第一行拆分为列:

filenames <- list.files(path="/mydirectory",
                        pattern=".dat", full.names=TRUE)

dfList <- lapply(filenames, function(f){
    # FIRST LINES
    con <- file(description=f, open="r")
    latlong <- readLines(con, warn=FALSE)[1]
    close(con)

    # DATA FRAME
    df <- read.table(f, skip=1, header=TRUE)
    df$lat <- strsplit(latlong, "\\s+")[[1]][1]
    df$lon <- strsplit(latlong, "\\s+")[[1]][2]

    return(df)
})

data <- rbindlist(dfList)

write.csv(data, "metcombi.dat")

From what I understand of the problem above, I would suggest parsing the files in two phases: 据我对上述问题的了解,我建议分两个阶段解析文件:

First, grab the first line from each file, eg: 首先,从每个文件中获取第一行,例如:

headers <- data.frame()

for (filename in filenames) {
    fp = open(filename)
    headers <- rbind(headers, readLines(fp, n=1))
    close(fp)
}

Next, use your same code above to loop over the files and grab the main data from each of them, but skip the header line using something like header=TRUE or skip=1 : 接下来,使用与上述相同的代码循环遍历文件并从每个文件中获取主要数据,但是使用header=TRUEskip=1类的内容跳过标题行:

...
temp <- lapply(filenames, fread, sep=",", header=TRUE)
data <- rbindlist(temp)

The only thing left to do is to add the lat,long information back in from the first loop. 剩下要做的唯一一件事就是从第一个循环中重新添加经纬度信息。 I'm not sure what the format is, but this may be as simple as a cbind() call on the two data.frames. 我不确定格式是什么,但这可能就像对两个data.frames的cbind()调用一样简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM