I have a dataset containing 485k strings (1.1 GB). Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmarks. Lengths of each variable are known. What is the best way to modify and mark the data by symbol ,
?
For example: I have strings like:
0123456789012...
1234567890123...
and array of lengths: 5,3,1,4,...
then I should get like this:
01234,567,8,9012,...
12345,678,9,0123,...
Could anyone help me with this? Python or R-tools are mostly preferred to me...
In R read.fwf
would work:
# inputs
x <- c("0123456789012...", "1234567890123... ")
widths <- c(5,3,1,4)
read.fwf(textConnection(x), widths, colClasses = "character")
giving:
V1 V2 V3 V4
1 01234 567 8 9012
2 12345 678 9 0123
If numeric rather than character columns were desired then drop the colClasses
argument.
Try this in R:
x <- "0123456789012"
y <- c(5,3,1,4)
output <- paste(substring(x,c(1,cumsum(y)+1),cumsum(y)),sep=",")
output <- output[-length(output)]
One option in R is
indx1 <- c(1, cumsum(len)[-length(len)]+1)
indx2 <- cumsum(len)
toString(vapply(seq_along(len), function(i)
substr(str1, indx1[i], indx2[i]), character(1)))
#[1] "01234, 567, 8, 9012"
str1 <- '0123456789012'
len <- c(5,3,1,4)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.