简体   繁体   中英

Extracting unique value sequences from DF column with R

I have the following data frame:

Col1 Col2
1    A
1    B
1    C
2    A
2    B
2    C
3    D
3    B
3    C
3    F
4    A
4    B
4    C

I'd like to extract unque sequence vectors (bus line stop sequences) from Col2 (actual stops of a particular bus route) where each sequence is defined by Col1 (respective bus route IDs) in R. The multiple occurence of identical sequences are unimportant. So, the desired outputs are:

A, B, C (in cases of Col1=1, 2 and 4) and D, B, C, F (in case of Col1=3)

You could split up the vector of bus stops according to the vector of route IDs. This will return a list of character vectors, on which you can call unique to remove the duplicated vectors (keeping the first occurrence).

Calling toString on each of these vectors through sapply will then convert the list of vectors to a vector of comma-separated strings.

res <- sapply(unique(split(df$Col2, df$Col1)), toString)
print(res)

From your question I have understood that you want the unique sequences for each col1 id. In order to test I changed your data a bit (and I used the data.table package). What you could try is the following:

require(data.table)
df <- fread('Col1 Col2
              1    A
              1    B
              1    C
              2    A
              2    B
              2    C
              1    A
              1    B
              1    C
              3    D
              3    B
              3    C
              3    F
              1    A
              1    F
              1    C
              4    A
              4    B
              4    C')

In your case, if your data frame is called df just do setDT(df) to turn it into a data table. And from this data table select the unique sequences in Col2 by:

df[, .(list(Col2), Col1), by = rleid(Col1)][,.(Sequence = unique(V1)), by = Col1]

Which gives:

    Col1 Sequence
1:    1    A,B,C
2:    1    A,F,C
3:    2    A,B,C
4:    3  D,B,C,F
5:    4    A,B,C

What the command does is the following: Fist, for every ID in Col1 I get the sequence in Col2 (I use the rleid function to identify continuous IDs in Col1). Then, I select the unique sequences by each Col1 value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM