[英]Clustering similar strings based on another column in R
I have a large data frame that shows the distance between strings and their counts.我有一个大型数据框,显示字符串之间的距离及其计数。
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2) .例如,在第 1 行中,您会看到apple和pple之间的距离,以及我计算apple (counts_col1= 100)和计算pple (counts_col2=2)的时间。
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)由reprex package (v2.0.1) 创建于 2022-03-15
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).现在我想根据具有最大计数的字符串对苹果和香蕉进行聚类,即苹果 (100) 和香蕉 (200)。 I want my data to look somehow like this
我希望我的数据看起来像这样
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. output 的格式不一定要这样。 I am really struggling to break down this problem and cluster the groups.
我真的很努力地分解这个问题并将这些群体聚集在一起。 Any help or comment are really appreciated!
非常感谢任何帮助或评论!
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id
, and identifier the "word" that has the largest value.这是一种方法,我最初为集合添加一个组标识符(我假设你在实际集合中有这个),然后在制作一个更长的类型数据集之后,我按这个
id
分组,标识符有最大的价值。 I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename.然后,我在初始 df 和这组具有 largest_value word、summarize 和 rename 的键行之间使用内部连接。 I push all the variants into a list column.
我将所有变体推入列表列。
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id
column) data.table 中的类似方法(也取决于具有该
id
列)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152
You can try using random walk clustering from igraph
:您可以尝试使用
igraph
中的随机游走聚类:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem.这适用于这个玩具示例,但我认为您的示例很简单,可能没有反映您的主要问题。 Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster).
或者,如果您对查找连接的组件感兴趣(如果两个单词连接,它们就在同一个集群中),它可能会更容易。 Then you would need to replace
walktrap.community
with components
.然后,您需要将
walktrap.community
替换为components
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.