[英]Extracting unique value sequences from DF column with R
I have the following data frame:我有以下数据框:
Col1 Col2
1 A
1 B
1 C
2 A
2 B
2 C
3 D
3 B
3 C
3 F
4 A
4 B
4 C
I'd like to extract unque sequence vectors (bus line stop sequences) from Col2 (actual stops of a particular bus route) where each sequence is defined by Col1 (respective bus route IDs) in R. The multiple occurence of identical sequences are unimportant.我想从 Col2(特定公交路线的实际停靠点)中提取独特的序列向量(公交线路停靠点序列),其中每个序列由 R 中的 Col1(相应的公交路线 ID)定义。相同序列的多次出现并不重要. So, the desired outputs are:
因此,所需的输出是:
A, B, C
(in cases of Col1=1, 2 and 4) and D, B, C, F
(in case of Col1=3) A, B, C
(在 Col1=1、2 和 4 的情况下)和D, B, C, F
(在 Col1=3 的情况下)
You could split up the vector of bus stops according to the vector of route IDs.您可以根据路线 ID 的向量拆分公交车站的向量。 This will return a list of character vectors, on which you can call
unique
to remove the duplicated vectors (keeping the first occurrence).这将返回一个字符向量列表,您可以在其上调用
unique
来删除重复的向量(保留第一次出现)。
Calling toString
on each of these vectors through sapply
will then convert the list of vectors to a vector of comma-separated strings.然后通过
sapply
对这些向量中的每一个调用toString
会将向量列表转换为逗号分隔字符串的向量。
res <- sapply(unique(split(df$Col2, df$Col1)), toString)
print(res)
From your question I have understood that you want the unique sequences for each col1 id.从您的问题中,我了解到您想要每个 col1 id 的唯一序列。 In order to test I changed your data a bit (and I used the data.table package).
为了测试我稍微改变了你的数据(我使用了 data.table 包)。 What you could try is the following:
您可以尝试以下方法:
require(data.table)
df <- fread('Col1 Col2
1 A
1 B
1 C
2 A
2 B
2 C
1 A
1 B
1 C
3 D
3 B
3 C
3 F
1 A
1 F
1 C
4 A
4 B
4 C')
In your case, if your data frame is called df just do setDT(df)
to turn it into a data table.在您的情况下,如果您的数据框被称为 df ,只需执行
setDT(df)
将其转换为数据表。 And from this data table select the unique sequences in Col2 by:并从此数据表中选择 Col2 中的唯一序列:
df[, .(list(Col2), Col1), by = rleid(Col1)][,.(Sequence = unique(V1)), by = Col1]
Which gives:这使:
Col1 Sequence
1: 1 A,B,C
2: 1 A,F,C
3: 2 A,B,C
4: 3 D,B,C,F
5: 4 A,B,C
What the command does is the following: Fist, for every ID in Col1 I get the sequence in Col2 (I use the rleid function to identify continuous IDs in Col1).该命令的作用如下:首先,对于 Col1 中的每个 ID,我都会得到 Col2 中的序列(我使用 rleid 函数来识别 Col1 中的连续 ID)。 Then, I select the unique sequences by each Col1 value.
然后,我按每个 Col1 值选择唯一序列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.