简体   繁体   English

有没有更有效的方式拆分列

[英]Is there a more efficient way to split columns

There are a few values that do not import correctly when performing this read.table: 执行此read.table时,有些值不能正确导入:

hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)

Specifically there are a few values where the industry_code and industry_name are joined as a single value in the industry_code column (not sure why). 具体来说,在industry_code列中有一些值将industry_code和industry_name作为单个值连接在一起(不确定原因)。 Given that each industry_code is 4 digits, my approach to split and correct is: 鉴于每个行业代码都是4位数字,我的拆分和更正方法是:

for (i in 1:nrow(hs.industry)) {
  if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) {
    hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i])
    hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i])
  }
}

I feel this is terribly innificent, but I'm not sure what approach would be better. 我觉得这是非常微不足道的,但是我不确定哪种方法会更好。

Thanks! 谢谢!

The problem is that lines 29 and 30 (rows 28 and 29, if we're not counting the header) have a formatting error. 问题在于第29行和第30行(如果不计算标题,则第28行和第29行)存在格式错误。 They use 4 spaces instead of a proper tab character. 他们使用4个空格而不是适当的制表符。 A bit of extra data cleaning is needed. 需要一些额外的数据清理。

Use readLines to read in the raw text, correct the formatting error, and then read in the cleaned table: 使用readLines读取原始文本,更正格式错误,然后读取已清除的表:

# read in each line of the file as a list of character elements
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry')

# replace any instances of 4 spaces with a tab character
hs.industry <- gsub('\\W{4,}', '\t', hs.industry)

# collapse together the list, with each line separated by a return character (\n)
hs.industry <- paste(hs.industry, collapse = '\n')

# read in the new table
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')

You should not have to loop through each instance, instead identify only those entries which are problematic and gsub only those entries: 您不必遍历每个实例,而只需识别出有问题的条目,而仅gsub识别那些条目:

replace_indx <- which(nchar(hs.industry$industry_code) > 4)
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])

I also used "\\\\d+\\\\s+" to improve the string replacement, here I also replace the spaces: 我还使用了"\\\\d+\\\\s+"来改善字符串替换,在这里我也替换了空格:

gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx])
# [1] "    Dimension stone"          "    Crushed and broken stone"

gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
# [1] "Dimension stone"          "Crushed and broken stone"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM