The best way to mark (split?) dataset in each string

Question

I have a dataset containing 485k strings (1.1 GB). Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmarks. Lengths of each variable are known. What is the best way to modify and mark the data by symbol , ?

For example: I have strings like:

0123456789012...
1234567890123...

and array of lengths: 5,3,1,4,... then I should get like this:

01234,567,8,9012,...
12345,678,9,0123,...

Could anyone help me with this? Python or R-tools are mostly preferred to me...

Answer 1

Pandas could load this using read_fwf :

In [321]:

t="""0123456789012..."""
pd.read_fwf(io.StringIO(t), widths=[5,3,1,4], header=None)
Out[321]:
      0    1  2     3
0  1234  567  8  9012

This will give you a dataframe allowing you to access each individual column for whatever purpose you require

Answer 2

In R read.fwf would work:

# inputs
x <- c("0123456789012...", "1234567890123... ")
widths <- c(5,3,1,4)

read.fwf(textConnection(x), widths, colClasses = "character")

giving:

     V1  V2 V3   V4
1 01234 567  8 9012
2 12345 678  9 0123

If numeric rather than character columns were desired then drop the colClasses argument.

Answer 3

Try this in R:

x <- "0123456789012"

y <- c(5,3,1,4)

output <- paste(substring(x,c(1,cumsum(y)+1),cumsum(y)),sep=",")
output <- output[-length(output)]

Answer 4

One option in R is

indx1 <- c(1, cumsum(len)[-length(len)]+1)
indx2 <- cumsum(len)
toString(vapply(seq_along(len), function(i)
         substr(str1, indx1[i], indx2[i]), character(1)))
#[1] "01234, 567, 8, 9012"

data

str1 <- '0123456789012'
len <- c(5,3,1,4)

The best way to mark (split?) dataset in each string

Question

4 answers

solution1
1 ACCPTED 2015-04-22 14:05:31

solution2
1 2015-04-22 14:18:40

solution3
1 2015-04-22 14:23:22

solution4
0 2015-04-22 14:18:29

data

The best way to mark (split?) dataset in each string

Question

4 answers

solution1 1 ACCPTED 2015-04-22 14:05:31

solution2 1 2015-04-22 14:18:40

solution3 1 2015-04-22 14:23:22

solution4 0 2015-04-22 14:18:29

data

solution1
1 ACCPTED 2015-04-22 14:05:31

solution2
1 2015-04-22 14:18:40

solution3
1 2015-04-22 14:23:22

solution4
0 2015-04-22 14:18:29