[英]How to read a text file whose variables are not stored on the same row, and that lacks a standard delimiter from column to column, into R?
I am trying to read a text file ( https://www.bls.gov/bdm/us_age_naics_00_table5.txt ) into R
, but I am not sure how to go about parsing it. I am trying to read a text file ( https://www.bls.gov/bdm/us_age_naics_00_table5.txt ) into
R
, but I am not sure how to go about parsing it. As you can see, the column names (years) are not located all on the same row, and the space between data is not consistent from column to column.如您所见,列名(年份)并非全部位于同一行,并且列与列之间的数据间距不一致。 I am familiar with using
read.csv()
and read.delim()
, but I'm not sure how to go about reading a complex file like this one.我熟悉使用
read.csv()
和read.delim()
,但我不知道如何 go 来阅读这样一个复杂的文件。
Here is a manual parse:这是一个手动解析:
require(readr)
string = read_lines(file="https://www.bls.gov/bdm/us_age_naics_00_table5.txt")
string = string[nchar(string) != 0]
string = string[-c(1,2)] # don't contain information
string = string[string != " "]
string = string[-151] # footnote
sMatrix = matrix(string, nrow = 30)
dfList = sapply(1:ncol(sMatrix), function(x) readr::read_table(paste(sMatrix[,x])))
df = do.call(cbind,dfList)
df = df[,!duplicated(colnames(df))] # removes columns with duplicate names
If you then want to recode "_" as NA
, and format the numbers:如果您想将 "_" 重新编码为
NA
,并格式化数字:
df[df == "_"] = NA
df = as.data.frame(sapply(df, function(x) gsub(",","",x)))
i <- apply(df, 2, function(x) !any(is.na(as.numeric(na.omit(x))))) # if a column can be converted to numeric without any NAs, e.g. column 1 can't
df[,i] = lapply(df[,i], as.numeric)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.