有沒有更有效的方式拆分列

Question

執行此read.table時，有些值不能正確導入：

hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)

具體來說，在industry_code列中有一些值將industry_code和industry_name作為單個值連接在一起（不確定原因）。 鑒於每個行業代碼都是4位數字，我的拆分和更正方法是：

for (i in 1:nrow(hs.industry)) {
  if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) {
    hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i])
    hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i])
  }
}

我覺得這是非常微不足道的，但是我不確定哪種方法會更好。

謝謝！

Answer 1

問題在於第29行和第30行（如果不計算標題，則第28行和第29行）存在格式錯誤。 他們使用4個空格而不是適當的制表符。 需要一些額外的數據清理。

使用readLines讀取原始文本，更正格式錯誤，然后讀取已清除的表：

# read in each line of the file as a list of character elements
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry')

# replace any instances of 4 spaces with a tab character
hs.industry <- gsub('\\W{4,}', '\t', hs.industry)

# collapse together the list, with each line separated by a return character (\n)
hs.industry <- paste(hs.industry, collapse = '\n')

# read in the new table
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')

Answer 2

您不必遍歷每個實例，而只需識別出有問題的條目，而僅gsub識別那些條目：

replace_indx <- which(nchar(hs.industry$industry_code) > 4)
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])

我還使用了"\\\\d+\\\\s+"來改善字符串替換，在這里我也替換了空格：

gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx])
# [1] "    Dimension stone"          "    Crushed and broken stone"

gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
# [1] "Dimension stone"          "Crushed and broken stone"

有沒有更有效的方式拆分列

問題描述

2 個解決方案

解決方案1
4 已采納 2017-03-06 18:45:48

解決方案2
1 2017-03-06 18:44:10

有沒有更有效的方式拆分列

問題描述

2 個解決方案

解決方案1 4 已采納 2017-03-06 18:45:48

解決方案2 1 2017-03-06 18:44:10

解決方案1
4 已采納 2017-03-06 18:45:48

解決方案2
1 2017-03-06 18:44:10