简体   繁体   English

用R从DF列中提取唯一值序列

[英]Extracting unique value sequences from DF column with R

I have the following data frame:我有以下数据框:

Col1 Col2
1    A
1    B
1    C
2    A
2    B
2    C
3    D
3    B
3    C
3    F
4    A
4    B
4    C

I'd like to extract unque sequence vectors (bus line stop sequences) from Col2 (actual stops of a particular bus route) where each sequence is defined by Col1 (respective bus route IDs) in R. The multiple occurence of identical sequences are unimportant.我想从 Col2(特定公交路线的实际停靠点)中提取独特的序列向量(公交线路停靠点序列),其中每个序列由 R 中的 Col1(相应的公交路线 ID)定义。相同序列的多次出现并不重要. So, the desired outputs are:因此,所需的输出是:

A, B, C (in cases of Col1=1, 2 and 4) and D, B, C, F (in case of Col1=3) A, B, C (在 Col1=1、2 和 4 的情况下)和D, B, C, F (在 Col1=3 的情况下)

You could split up the vector of bus stops according to the vector of route IDs.您可以根据路线 ID 的向量拆分公交车站的向量。 This will return a list of character vectors, on which you can call unique to remove the duplicated vectors (keeping the first occurrence).这将返回一个字符向量列表,您可以在其上调用unique来删除重复的向量(保留第一次出现)。

Calling toString on each of these vectors through sapply will then convert the list of vectors to a vector of comma-separated strings.然后通过sapply对这些向量中的每一个调用toString会将向量列表转换为逗号分隔字符串的向量。

res <- sapply(unique(split(df$Col2, df$Col1)), toString)
print(res)

From your question I have understood that you want the unique sequences for each col1 id.从您的问题中,我了解到您想要每个 col1 id 的唯一序列。 In order to test I changed your data a bit (and I used the data.table package).为了测试我稍微改变了你的数据(我使用了 data.table 包)。 What you could try is the following:您可以尝试以下方法:

require(data.table)
df <- fread('Col1 Col2
              1    A
              1    B
              1    C
              2    A
              2    B
              2    C
              1    A
              1    B
              1    C
              3    D
              3    B
              3    C
              3    F
              1    A
              1    F
              1    C
              4    A
              4    B
              4    C')

In your case, if your data frame is called df just do setDT(df) to turn it into a data table.在您的情况下,如果您的数据框被称为 df ,只需执行setDT(df)将其转换为数据表。 And from this data table select the unique sequences in Col2 by:并从此数据表中选择 Col2 中的唯一序列:

df[, .(list(Col2), Col1), by = rleid(Col1)][,.(Sequence = unique(V1)), by = Col1]

Which gives:这使:

    Col1 Sequence
1:    1    A,B,C
2:    1    A,F,C
3:    2    A,B,C
4:    3  D,B,C,F
5:    4    A,B,C

What the command does is the following: Fist, for every ID in Col1 I get the sequence in Col2 (I use the rleid function to identify continuous IDs in Col1).该命令的作用如下:首先,对于 Col1 中的每个 ID,我都会得到 Col2 中的序列(我使用 rleid 函数来识别 Col1 中的连续 ID)。 Then, I select the unique sequences by each Col1 value.然后,我按每个 Col1 值选择唯一序列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM