I am trying to extract the values from a website. The extracted values look like this.
"3000 ---- ---- ---- ---- '1 UNCH '1"
"4600 ---- ---- ---- ---- '1 UNCH '1"
"4800 ---- ---- ---- ---- '1 UNCH '1"
"5000 ---- ---- ---- ---- '1 UNCH '1 300"
"5200 ---- ---- ---- ---- '1 UNCH '1"
"5400 ---- ---- ---- ---- '1 UNCH '1"
"5600 ---- ---- ---- ---- '1 UNCH '1 10"
"5800 ---- ---- ---- ---- '1 UNCH '1 1"
"6000 ---- ---- ---- ---- '1 UNCH '1 5461"
"6200 ---- ---- ---- ---- '1 UNCH '1 54"
"6400 ---- ---- ---- ---- '1 UNCH '1 2009"
"6600 ---- ---- ---- ---- '1 UNCH '1 124"
"6800 ---- ---- ---- ---- '1 UNCH '1 410"
"7000 ---- ---- ---- ---- '1 -'1 '2 10704"
"7200 ---- ---- '2A ---- '2 -'1 '3 9927"
"7400 ---- ---- ---- ---- '3 UNCH '3 7869"
"7600 ---- ---- ---- ---- '4 UNCH '4 30 13596"
"7800 ---- ---- ---- ---- '5 -'1 '6 109 16030"
"8000 '7 '7 '7 '7 '7 -'1 467 1'0 731 26912"
"8200 1'4 1'4 1'3 ---- 1'2 -'2 119 1'4 222 11030"
"8400 2'2 2'2 2'0 2'0 1'7 -'4 426 2'3 172 15743"
"8600 3'1 3'3 2'7 3'0A 3'0 -'4 66 3'4 330 18964"
There are some rows with less columns values. I want to create a data frame of 11 columns and the values which are blank should remain blank. When I try to split the values based on space the rows with less column values gets overlapped and repeated. Please find the code which I have tried.
cc=gsub("\\s+"," ",df)
cc=data.frame(cc)
cc = data.frame(do.call('rbind', strsplit(as.character(cc),' ',fixed=TRUE)))
Update, original question has changed.
It looks like your data is fixed-width format. You can use ?read.fwf
, though its use depends somewhat on how reliable your data source is. If the place you are getting your data from had a specification as to how the data would always be formatted (eg "11 columns of width 10 characters each"), that would be helpful.
# pad out each line to the same length
maxlen <- max(sapply(df, nchar)) # it's 110 for your data, it seems
df <- sprintf(paste0("%-", maxlen, "s"), df)
read.fwf(textConnection(df),
widths=c(4, 11, 10, 10, 11, 9, 8, 12, 11, 12, 12))
The widths I've picked are appropriate to the data you provided; you will have to determine sensible values for yourself based on what you expect.
You could just use indexing to put NAs in the empty spots, eg (1:9)[1:11]
will select the first 9 elements (being 1:9
) and then put two NA
on the end to pad it out to 11 elements long.
# assuming df is such that df[1] is the first line, df[2] is the second etc
tmp <- strsplit(df, '\\s+')
ncols <- max(sapply(tmp, length)) # could do max(lengths(tmp)) if you have a new
# enough R. Or if you already know there are
# at most 9 columns just set it to 9 directly
cc <- do.call('rbind', lapply(tmp, '[', i=seq_len(ncols)))
cc <- data.frame(cc)
You could try to use the constant distances in your columns, each column covers the characters start:end. If there are missing columns at the end, NA will be filled to the columns. The variable "line" contains one single line of the extracted file.
start <- c(1,6,17, 27,37,47,57,65,77,88,100)
end <- c(5,16,26,36,46,56,64,76,87,99,110)
columns <- list()
for(j in 1:length(start)){
if(start[j] <= nchar(line)){
columns[[j]] <- substr(line, start[j],end[j])
}
else{
y[[j]] <- NA
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.