Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:
A|1 B|2 0.5 0.4
C|3 D|4 0.9 1
I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:
A B 0.5 0.4
I am using a connection to read in the file:
con <- file("inputfile.txt", open = "r")
lines <- readLines(con)
which gives me:
lines[1]
[1] "A|1\tB|2/t0.5\t0.4"
then I am using strsplit to split the tab delimited file:
linessplit <- strsplit(lines, split="\t")
which gives me:
linessplit[1]
[1] "A|1" "B|2"
[3] "0.5" "0.4"
When I try the following to split "A|1" into "A" "1":
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
I get:
"Error in strsplit(line1[1], split = "|") : non-character argument"
Does anyone have a way in which I can fix this? Thanks!
Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach. Anyway putting aside personal tastes about code, the problems are:
line1[[1]]
split
argument accepts regular expressions. If you supply |
which is a metacharacter, it won't be read as is . You must escape it with \\\\|
or (as suggested by @nongkrong) you have to use the fixed = T
argument that allows you to match strings exactly as is (say, without their meaning as a meta characters). The final code is l1 <- strsplit(line1[[1]], split = "\\\\|")
as a final personal consideration, you might take into considerations an lapply
solution:
lapply(linessplit, strsplit, split = "|", fixed = T)
Here is my solution to your original problem, says
split lines
"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"
into
A B 0.5 0.4
C D 0.9 1
Below is my code:
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
That's break it down:
Read original data into lines
lines <- c("A|1\\tB|2\\t0.5\\t0.4", "C|3\\tD|4\\t0.9\\t1", "E|5\\tF|6\\t0.7\\t0.2") lines
Results:
[1] "A|1\\tB|2\\t0.5\\t0.4" "C|3\\tD|4\\t0.9\\t1" "E|5\\tF|6\\t0.7\\t0.2"
Load reshape2 library to import function colsplit , then use it with pattern "\\t" to split lines into 4 columns named 1,2,3,4.
library(reshape2) linessplit <- colsplit(lines, pattern="\\t", names=c(1,2,3,4)) linessplit
Results:
1 2 3 4 1 A|1 B|2 0.5 0.4 2 C|3 D|4 0.9 1.0 3 E|5 F|6 0.7 0.2
That's make a function to take a row, split into rows and select the row we want.
Take the first row of linessplit into colsplit
tmp <- colsplit(linessplit[1,], pattern="\\\\|", names=c(1:2)) tmp
Results:
1 2 1 A 1 2 B 2 3 0.5 NA 4 0.4 NA
Take transpose
tmp <- t(colsplit(linessplit[1,], pattern="\\\\|", names=c(1:2))) tmp
Results:
[,1] [,2] [,3] [,4] 1 "A" "B" "0.5" "0.4" 2 " 1" " 2" NA NA
Select first row:
tmp[1,]
Results:
[1] "A" "B" "0.5" "0.4"
Make above steps a function split_n_select :
split_n_select <- function(x, sel=c(1), pat="\\\\|", nam=c(1:2)){ tmp <- t(colsplit(x, pattern=pat, names=nam)) tmp[sel,] }
Use sapply to apply function split_n_select to each row in linessplit
linessplit2 <- sapply(linessplit, split_n_select) linessplit2
Results:
1 2 3 4 [1,] "A" "B" "0.5" "0.4" [2,] "C" "D" "0.9" "1" [3,] "E" "F" "0.7" "0.2"
You can also select the second row by adding sel=c(2)
linessplit2 <- sapply(linessplit, split_n_select, sel=c(2)) linessplit2
Results:
1 2 3 4 [1,] "1" "2" NA NA [2,] "3" "4" NA NA [3,] "5" "6" NA NA
Change
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
to
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.