简体   繁体   中英

How to split/parse long strings into tabular data with R data.table/data.frame?

I have an R data.table with a column of strangely formatted data which I need to parse. For each row, there is a column identity which is in the following format:

identity
cat:211:93|dog:616:58|bird:1270:46|fish:2068:31|horse:614:1|cow:3719:1012

It's the format name:total_number:count_number , separated by |

An example of the data.table is as follows:

library(data.table)

foo = data.table(name = c('Luna', 'Bob', 'Melissa'), 
    number = c(23, 37, 33), 
    identity = c('cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12', 'bird:1270:35|fish:2068:11|horse:614:44|cow:319:21', 'fish:72:41'))

print(foo)
name        number    identity
'Luna'      23        cat:311:93|dog:516:58|bird:2270:46|fish:1268:31|horse:514:1|cow:319:12
'Bob'       37        bird:1270:35|fish:2068:11|horse:614:44|cow:319:21
'Melissa'   33        fish:72:41

My problem is how to parse these lines such that each name becomes a new column, and the numbers are calculated as a fraction, count_number/total_number .

The correct format is as follows:

name        number    cat        dog         bird        fish        horse       cow
'Luna'      23        0.2990354  0.1124031   0.02026432  0.02444795  0.001945525 0.03761755
'Bob'       37        NA         NA          0.02755906   0.005319149    0.001628664     0.03761755
'Melissa'   33        NA         NA          NA          0.5694444   NA       NA

How could I parse these rows, given I know the 'names' of the columns beforehand?

I think there should be some way to use data.table::tstrsplit() , eg

tstrsplit(foo$identity, "|", fixed=TRUE)

(I'm happy to use a data.frame or dplyr as well.)

You can probably split by |, melt, then split by : again before calculating ratio and reshaping to your desired format.

library(data.table)
#step 4: reshape into desired wide format
dcast(
    #step 1: split by | and get the elements into a column
    foo[, melt(tstrsplit(identity, "\\|")), by=.(name, number)][,
        #step 2: split by : to get count_number and total_number
        tstrsplit(value, ":"), by=.(name, number)][,
            #step 3: calculate ratio
            ratio := as.numeric(V3) / as.numeric(V2)],
    name + number ~ V1, value.var="ratio")

output:

      name number       bird       cat        cow       dog        fish       horse
1:     Bob     37 0.02755906        NA 0.06583072        NA 0.005319149 0.071661238
2:    Luna     23 0.02026432 0.2990354 0.03761755 0.1124031 0.024447950 0.001945525
3: Melissa     33         NA        NA         NA        NA 0.569444444          NA

Addressing OP's comment in a more general way: You have to design a solution to your problem first before coding. Picture in your mind what kind of output you are expecting in each step of your solution. Then let the console be your TA and documentation be your lecturer.

For eg in your first step of your solution, you split by | , so you run the below in the console

foo[, tstrsplit(identity, "|", fixed=TRUE)]

What are your expecting? What do you see? Missing name and number ? Add them in by= .

foo[, tstrsplit(identity, "|", fixed=TRUE), by=.(name, number)]

Then, what do you get? Error? Can you fix it? Maybe read the documentation again? If still unable to solve it, maybe search for it online? Remember what you are trying to achieve with this step: How to get it into a single column? Maybe you find something like below:

foo[, unlist(tstrsplit(identity, "|", fixed=TRUE)), by=.(name, number)]

Then, move on to the next step.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM