用R从DF列中提取唯一值序列

Question

I have the following data frame:我有以下数据框：

I'd like to extract unque sequence vectors (bus line stop sequences) from Col2 (actual stops of a particular bus route) where each sequence is defined by Col1 (respective bus route IDs) in R. The multiple occurence of identical sequences are unimportant.我想从 Col2（特定公交路线的实际停靠点）中提取独特的序列向量（公交线路停靠点序列），其中每个序列由 R 中的 Col1（相应的公交路线 ID）定义。相同序列的多次出现并不重要. So, the desired outputs are:因此，所需的输出是：

A, B, C (in cases of Col1=1, 2 and 4) and D, B, C, F (in case of Col1=3) A, B, C （在 Col1=1、2 和 4 的情况下）和D, B, C, F （在 Col1=3 的情况下）

Answer 1

You could split up the vector of bus stops according to the vector of route IDs.您可以根据路线 ID 的向量拆分公交车站的向量。 This will return a list of character vectors, on which you can call unique to remove the duplicated vectors (keeping the first occurrence).这将返回一个字符向量列表，您可以在其上调用unique来删除重复的向量（保留第一次出现）。

Calling toString on each of these vectors through sapply will then convert the list of vectors to a vector of comma-separated strings.然后通过sapply对这些向量中的每一个调用toString会将向量列表转换为逗号分隔字符串的向量。

res <- sapply(unique(split(df$Col2, df$Col1)), toString)
print(res)

Answer 2

From your question I have understood that you want the unique sequences for each col1 id.从您的问题中，我了解到您想要每个 col1 id 的唯一序列。 In order to test I changed your data a bit (and I used the data.table package).为了测试我稍微改变了你的数据（我使用了 data.table 包）。 What you could try is the following:您可以尝试以下方法：

require(data.table)
df <- fread('Col1 Col2
              1    A
              1    B
              1    C
              2    A
              2    B
              2    C
              1    A
              1    B
              1    C
              3    D
              3    B
              3    C
              3    F
              1    A
              1    F
              1    C
              4    A
              4    B
              4    C')

In your case, if your data frame is called df just do setDT(df) to turn it into a data table.在您的情况下，如果您的数据框被称为 df ，只需执行setDT(df)将其转换为数据表。 And from this data table select the unique sequences in Col2 by:并从此数据表中选择 Col2 中的唯一序列：

df[, .(list(Col2), Col1), by = rleid(Col1)][,.(Sequence = unique(V1)), by = Col1]

Which gives:这使：

    Col1 Sequence
1:    1    A,B,C
2:    1    A,F,C
3:    2    A,B,C
4:    3  D,B,C,F
5:    4    A,B,C

What the command does is the following: Fist, for every ID in Col1 I get the sequence in Col2 (I use the rleid function to identify continuous IDs in Col1).该命令的作用如下：首先，对于 Col1 中的每个 ID，我都会得到 Col2 中的序列（我使用 rleid 函数来识别 Col1 中的连续 ID）。 Then, I select the unique sequences by each Col1 value.然后，我按每个 Col1 值选择唯一序列。

用R从DF列中提取唯一值序列

问题描述

2 个解决方案

解决方案1
2 2016-12-09 10:30:31

解决方案2
0 2016-12-09 11:53:38

用R从DF列中提取唯一值序列

问题描述

2 个解决方案

解决方案1 2 2016-12-09 10:30:31

解决方案2 0 2016-12-09 11:53:38

解决方案1
2 2016-12-09 10:30:31

解决方案2
0 2016-12-09 11:53:38