R：從列標題略有不同（不同空格）的 txt 文件中讀取特定列並綁定它們？

Question

我有許多txt文件，它們在由; 分隔的列中包含相同類型的數值數據。 但是有些文件的列標題帶有空格，而有些則沒有（由不同的人創建）。 有些有我不想要的額外列。

例如，一個文件可能有 header，例如：

ASomeName; BSomeName; C(someName%)

而另一個文件 header 可能是

A Some Name; B Some Name; C(someName%); D some name

在調用“讀取”命令之前，如何清除名稱中的空格？

#These are the files I have

filenames<-list.files(pattern = "*.txt",recursive = TRUE,full.names = TRUE)%>%as_tibble()

#These are the columns I would like:

colSelect=c("Date","Time","Timestamp" ,"PM2_5(ug/m3)","PM10(ug/m3)","PM01(ug/m3)","Temperature(C)",  "Humidity(%RH)", "CO2(ppm)")

#This is how I read them if they have the same columns

ldf <- vroom::vroom(filenames, col_select = colSelect,delim=";",id = "sensor" )%>%janitor::clean_names()

清理標題腳本

我編寫了一個破壞性腳本，它將讀取整個文件，清理 header 的空格，刪除文件並重新寫入（vroom 有時抱怨無法打開 X 數千個文件）使用相同的文件姓名。 不是一種高效的做事方式。

cleanHeaders<-function(filename){
  d<-vroom::vroom(filename,delim=";")%>%janitor::clean_names()
  #print(head(d))
  if (file.exists(filename)) {
    #Delete file if it exists
    file.remove(filename)
  }
  vroom::vroom_write(d,filename,delim = ";")
}

lapply(filenames,cleanHeaders)

Answer 1

fread 的select參數承認 integer 索引。 如果所需的列始終位於相同的 position 中，那么您的工作就完成了。

colIndexes = c(1,3,4,7,9,18,21)
data = lapply(filenames, fread, select = colIndexes)

我想 vroom 也有這個功能，但是由於你已經在選擇你想要的列，我認為懶惰地評估你的字符列根本沒有幫助，所以我建議你堅持 data.table。

但是，對於更健壯的解決方案，由於您無法控制表的結構：您可以讀取每個文件的一行，捕獲並清理列名，然后將它們與colSelect向量的干凈版本進行匹配。

library(data.table)
library(janitor)
library(purrr)

filenames <- list.files(pattern = "*.txt",
                        recursive = TRUE,
                        full.names = TRUE)

# read the first row of data to capture and clean the column names
clean_col_names <- function(filename){
  colnames(janitor::clean_names(fread(filename, nrow = 1)))
}

clean_column_names <- map(.x = filenames, 
                          .f = clean_col_names)

# clean the colSelect vector
colSelect <- janitor::make_clean_names(c("Date",
                                         "Time",
                                         "Timestamp" ,
                                         "PM2_5(ug/m3)",
                                         "PM10(ug/m3)",
                                         "PM01(ug/m3)",
                                         "Temperature(C)",
                                         "Humidity(%RH)",
                                         "CO2(ppm)"))

# match each set of column names against the clean colSelect
select_indices <- map(.x = clean_column_names, 
                      .f = function(cols) match(colSelect, cols))

# use map2 to read only the matched indexes for each column
data <- purrr::map2(.x = filenames, 
                    .y = select_indices, 
                    ~fread(input = .x, select = .y))

（這里的 purrr 可以很容易地用傳統的 lapply 替換，我選擇了 purrr 因為它的公式符號更清晰）

R：從列標題略有不同（不同空格）的 txt 文件中讀取特定列並綁定它們？

問題描述

1 個解決方案

解決方案1
1 已采納 2021-04-01 15:16:02

R：從列標題略有不同（不同空格）的 txt 文件中讀取特定列並綁定它們？

問題描述

1 個解決方案

解決方案1 1 已采納 2021-04-01 15:16:02

解決方案1
1 已采納 2021-04-01 15:16:02