简体   繁体   中英

Splitting a Large Data File in R using Strsplit and R Connection

Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:

A|1   B|2   0.5  0.4
C|3   D|4   0.9  1

I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:

A  B  0.5  0.4

I am using a connection to read in the file:

con <- file("inputfile.txt", open = "r")
lines <- readLines(con)

which gives me:

lines[1]
[1] "A|1\tB|2/t0.5\t0.4"

then I am using strsplit to split the tab delimited file:

linessplit <- strsplit(lines, split="\t")

which gives me:

linessplit[1]
[1] "A|1" "B|2" 
[3] "0.5" "0.4"

When I try the following to split "A|1" into "A" "1":

line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")

I get:

"Error in strsplit(line1[1], split = "|") : non-character argument"

Does anyone have a way in which I can fix this? Thanks!

Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach. Anyway putting aside personal tastes about code, the problems are:

  1. you have to extract the first element of the list with double brackets line1[[1]]
  2. the split argument accepts regular expressions. If you supply | which is a metacharacter, it won't be read as is . You must escape it with \\\\| or (as suggested by @nongkrong) you have to use the fixed = T argument that allows you to match strings exactly as is (say, without their meaning as a meta characters).

The final code is l1 <- strsplit(line1[[1]], split = "\\\\|")

as a final personal consideration, you might take into considerations an lapply solution:

lapply(linessplit, strsplit, split = "|", fixed = T)

Here is my solution to your original problem, says

split lines

"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"

into

A  B  0.5  0.4
C  D  0.9  1

Below is my code:

lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines

library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit

split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
  tmp <- t(colsplit(x, pattern=pat, names=nam))
  tmp[sel,]
}

linessplit2 <- sapply(linessplit, split_n_select)
linessplit2

That's break it down:

  1. Read original data into lines

     lines <- c("A|1\\tB|2\\t0.5\\t0.4", "C|3\\tD|4\\t0.9\\t1", "E|5\\tF|6\\t0.7\\t0.2") lines 

    Results:

      [1] "A|1\\tB|2\\t0.5\\t0.4" "C|3\\tD|4\\t0.9\\t1" "E|5\\tF|6\\t0.7\\t0.2" 
  2. Load reshape2 library to import function colsplit , then use it with pattern "\\t" to split lines into 4 columns named 1,2,3,4.

     library(reshape2) linessplit <- colsplit(lines, pattern="\\t", names=c(1,2,3,4)) linessplit 

    Results:

      1 2 3 4 1 A|1 B|2 0.5 0.4 2 C|3 D|4 0.9 1.0 3 E|5 F|6 0.7 0.2 
  3. That's make a function to take a row, split into rows and select the row we want.

    Take the first row of linessplit into colsplit

     tmp <- colsplit(linessplit[1,], pattern="\\\\|", names=c(1:2)) tmp 

    Results:

      1 2 1 A 1 2 B 2 3 0.5 NA 4 0.4 NA 

    Take transpose

     tmp <- t(colsplit(linessplit[1,], pattern="\\\\|", names=c(1:2))) tmp 

    Results:

      [,1] [,2] [,3] [,4] 1 "A" "B" "0.5" "0.4" 2 " 1" " 2" NA NA 

    Select first row:

     tmp[1,] 

    Results:

     [1] "A" "B" "0.5" "0.4" 

    Make above steps a function split_n_select :

     split_n_select <- function(x, sel=c(1), pat="\\\\|", nam=c(1:2)){ tmp <- t(colsplit(x, pattern=pat, names=nam)) tmp[sel,] } 
  4. Use sapply to apply function split_n_select to each row in linessplit

     linessplit2 <- sapply(linessplit, split_n_select) linessplit2 

    Results:

      1 2 3 4 [1,] "A" "B" "0.5" "0.4" [2,] "C" "D" "0.9" "1" [3,] "E" "F" "0.7" "0.2" 
  5. You can also select the second row by adding sel=c(2)

     linessplit2 <- sapply(linessplit, split_n_select, sel=c(2)) linessplit2 

    Results:

      1 2 3 4 [1,] "1" "2" NA NA [2,] "3" "4" NA NA [3,] "5" "6" NA NA 

Change

line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")

to

line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM