简体   繁体   中英

convert character to data frame or matrix with fixed number of columns

I am trying to extract the values from a website. The extracted values look like this.

"3000       ----      ----      ----      ----        '1    UNCH                     '1"                        
"4600       ----      ----      ----      ----        '1    UNCH                     '1"                        
"4800       ----      ----      ----      ----        '1    UNCH                     '1"                        
"5000       ----      ----      ----      ----        '1    UNCH                     '1                     300"
"5200       ----      ----      ----      ----        '1    UNCH                     '1"                        
"5400       ----      ----      ----      ----        '1    UNCH                     '1"                        
"5600       ----      ----      ----      ----        '1    UNCH                     '1                      10"
"5800       ----      ----      ----      ----        '1    UNCH                     '1                       1"
"6000       ----      ----      ----      ----        '1    UNCH                     '1                    5461"
"6200       ----      ----      ----      ----        '1    UNCH                     '1                      54"
"6400       ----      ----      ----      ----        '1    UNCH                     '1                    2009"
"6600       ----      ----      ----      ----        '1    UNCH                     '1                     124"
"6800       ----      ----      ----      ----        '1    UNCH                     '1                     410"
"7000       ----      ----      ----      ----        '1     -'1                     '2                   10704"
"7200       ----      ----        '2A     ----        '2     -'1                     '3                    9927"
"7400       ----      ----      ----      ----        '3    UNCH                     '3                    7869"
"7600       ----      ----      ----      ----        '4    UNCH                     '4          30       13596"
"7800       ----      ----      ----      ----        '5     -'1                     '6         109       16030"
"8000         '7        '7        '7        '7        '7     -'1         467        1'0         731       26912"
"8200        1'4       1'4       1'3      ----       1'2     -'2         119        1'4         222       11030"
"8400        2'2       2'2       2'0       2'0       1'7     -'4         426        2'3         172       15743"
"8600        3'1       3'3       2'7       3'0A      3'0     -'4          66        3'4         330       18964"

There are some rows with less columns values. I want to create a data frame of 11 columns and the values which are blank should remain blank. When I try to split the values based on space the rows with less column values gets overlapped and repeated. Please find the code which I have tried.

  cc=gsub("\\s+"," ",df)
  cc=data.frame(cc)
  cc = data.frame(do.call('rbind', strsplit(as.character(cc),' ',fixed=TRUE)))

Update, original question has changed.

It looks like your data is fixed-width format. You can use ?read.fwf , though its use depends somewhat on how reliable your data source is. If the place you are getting your data from had a specification as to how the data would always be formatted (eg "11 columns of width 10 characters each"), that would be helpful.

# pad out each line to the same length
maxlen <- max(sapply(df, nchar)) # it's 110 for your data, it seems
df <- sprintf(paste0("%-", maxlen, "s"), df)
read.fwf(textConnection(df),
         widths=c(4, 11, 10, 10, 11,  9,  8, 12, 11, 12, 12))

The widths I've picked are appropriate to the data you provided; you will have to determine sensible values for yourself based on what you expect.


You could just use indexing to put NAs in the empty spots, eg (1:9)[1:11] will select the first 9 elements (being 1:9 ) and then put two NA on the end to pad it out to 11 elements long.

# assuming df is such that df[1] is the first line, df[2] is the second etc
tmp <- strsplit(df, '\\s+')
ncols <- max(sapply(tmp, length)) # could do max(lengths(tmp)) if you have a new
                                  # enough R. Or if you already know there are
                                  # at most 9 columns just set it to 9 directly
cc <- do.call('rbind', lapply(tmp, '[', i=seq_len(ncols)))
cc <- data.frame(cc)

You could try to use the constant distances in your columns, each column covers the characters start:end. If there are missing columns at the end, NA will be filled to the columns. The variable "line" contains one single line of the extracted file.

start <- c(1,6,17, 27,37,47,57,65,77,88,100)
end   <- c(5,16,26,36,46,56,64,76,87,99,110)

columns <- list()    
for(j in 1:length(start)){
    if(start[j] <= nchar(line)){
        columns[[j]] <- substr(line, start[j],end[j])
    }
    else{
        y[[j]] <- NA
    }    
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM