有没有更有效的方式拆分列

Question

There are a few values that do not import correctly when performing this read.table: 执行此read.table时，有些值不能正确导入：

hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)

Specifically there are a few values where the industry_code and industry_name are joined as a single value in the industry_code column (not sure why). 具体来说，在industry_code列中有一些值将industry_code和industry_name作为单个值连接在一起（不确定原因）。 Given that each industry_code is 4 digits, my approach to split and correct is: 鉴于每个行业代码都是4位数字，我的拆分和更正方法是：

for (i in 1:nrow(hs.industry)) {
  if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) {
    hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i])
    hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i])
  }
}

I feel this is terribly innificent, but I'm not sure what approach would be better. 我觉得这是非常微不足道的，但是我不确定哪种方法会更好。

Thanks! 谢谢！

Answer 1

The problem is that lines 29 and 30 (rows 28 and 29, if we're not counting the header) have a formatting error. 问题在于第29行和第30行（如果不计算标题，则第28行和第29行）存在格式错误。 They use 4 spaces instead of a proper tab character. 他们使用4个空格而不是适当的制表符。 A bit of extra data cleaning is needed. 需要一些额外的数据清理。

Use readLines to read in the raw text, correct the formatting error, and then read in the cleaned table: 使用readLines读取原始文本，更正格式错误，然后读取已清除的表：

# read in each line of the file as a list of character elements
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry')

# replace any instances of 4 spaces with a tab character
hs.industry <- gsub('\\W{4,}', '\t', hs.industry)

# collapse together the list, with each line separated by a return character (\n)
hs.industry <- paste(hs.industry, collapse = '\n')

# read in the new table
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')

Answer 2

You should not have to loop through each instance, instead identify only those entries which are problematic and gsub only those entries: 您不必遍历每个实例，而只需识别出有问题的条目，而仅gsub识别那些条目：

replace_indx <- which(nchar(hs.industry$industry_code) > 4)
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])

I also used "\\\\d+\\\\s+" to improve the string replacement, here I also replace the spaces: 我还使用了"\\\\d+\\\\s+"来改善字符串替换，在这里我也替换了空格：

gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx])
# [1] "    Dimension stone"          "    Crushed and broken stone"

gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
# [1] "Dimension stone"          "Crushed and broken stone"

有没有更有效的方式拆分列

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-03-06 18:45:48

解决方案2
1 2017-03-06 18:44:10

有没有更有效的方式拆分列

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-03-06 18:45:48

解决方案2 1 2017-03-06 18:44:10

解决方案1
4 已采纳 2017-03-06 18:45:48

解决方案2
1 2017-03-06 18:44:10