简体   繁体   English

基于 R 中的另一列对相似字符串进行聚类

[英]Clustering similar strings based on another column in R

I have a large data frame that shows the distance between strings and their counts.我有一个大型数据框,显示字符串之间的距离及其计数。

For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2) .例如,在第 1 行中,您会看到applepple之间的距离,以及我计算apple (counts_col1= 100)和计算pple (counts_col2=2)的时间。

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
                 col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             distance = c(1,2,3,1,1,2),
          counts_col1 = c(100,100,2,200,200,2),
          counts_col2 = c(2,50,50,2,20,20))
df    
#> # A tibble: 6 × 5
#>   col1    col2    distance counts_col1 counts_col2
#>   <chr>   <chr>      <dbl>       <dbl>       <dbl>
#> 1 apple   pple           1         100           2
#> 2 apple   app            2         100          50
#> 3 pple    app            3           2          50
#> 4 banana  bananna        1         200           2
#> 5 banana  banan          1         200          20
#> 6 bananna banan          2           2          20

Created on 2022-03-15 by the reprex package (v2.0.1)reprex package (v2.0.1) 创建于 2022-03-15

Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).现在我想根据具有最大计数的字符串对苹果和香蕉进行聚类,即苹果 (100) 和香蕉 (200)。 I want my data to look somehow like this我希望我的数据看起来像这样

cluster   elements  sum_counts
 apple      apple    152
  NA        pple      NA
  NA         app      NA
 banana     banana   222
  NA       bananna    NA
  NA         banan    NA

The format of the output does not have to be like this. output 的格式不一定要这样。 I am really struggling to break down this problem and cluster the groups.我真的很努力地分解这个问题并将这些群体聚集在一起。 Any help or comment are really appreciated!非常感谢任何帮助或评论!

Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id , and identifier the "word" that has the largest value.这是一种方法,我最初为集合添加一个组标识符(我假设你在实际集合中有这个),然后在制作一个更长的类型数据集之后,我按这个id分组,标识符有最大的价值。 I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename.然后,我在初始 df 和这组具有 largest_value word、summarize 和 rename 的键行之间使用内部连接。 I push all the variants into a list column.我将所有变体推入列表列。

df <- df %>% mutate(id=c(1,1,1,2,2,2))

df %>% inner_join(
   rbind(
    df %>% select(id,distance,col=col1, counts=counts_col1),
    df %>% select(id,distance,col=col2, counts=counts_col2)
  ) %>% 
  group_by(id) %>% 
  slice_max(counts) %>% 
  distinct(col), 
  by=c("col1"="col")
) %>% 
  group_by(col1) %>% 
  summarize(variants = list(c(col1, cur_group()$col1)),
            total = min(counts_col1) + sum(counts_col2)) %>% 
  rename_all(~c("cluster", "elements", "sum_counts"))

# A tibble: 2 x 3
  cluster elements  sum_counts
  <chr>   <list>         <dbl>
1 apple   <chr [3]>        152
2 banana  <chr [3]>        222

A similar approach in data.table (also depends on having that id column) data.table 中的类似方法(也取决于具有该id列)

setDT(df)
df[rbind(
  df[,.(id,col=col1,counts=counts_col1)],
  df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
  ,  .(elements=list(c(col2,.BY$cluster)),
       sum_counts = min(counts_col1) + sum(counts_col2)),
  by=.(cluster=col1)]


   cluster             elements sum_counts
    <char>               <list>      <num>
1:  banana bananna,banan,banana        222
2:   apple       pple,app,apple        152

You can try using random walk clustering from igraph :您可以尝试使用igraph中的随机游走聚类:

count_df <- data.table::melt(
  data.table::as.data.table(df), 
  measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
  value.name = c("col", "counts")
) %>%
  select(col, counts) %>%
  unique()

df %>%
  igraph::graph_from_data_frame(directed = FALSE) %>%
  igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
  # igraph::components() %>%
  igraph::membership() %>%
  split(names(.), .) %>%
  map_dfr(
    ~tibble(col = .x) %>% 
      semi_join(count_df, ., by = "col") %>% 
      arrange(desc(counts)) %>%
      summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
  )

  cluster               elements sum_count
1   apple       apple, app, pple       152
2  banana banana, banan, bananna       222

This works on this toy example, but I think your example is to simple and probably does not reflect your main problem.这适用于这个玩具示例,但我认为您的示例很简单,可能没有反映您的主要问题。 Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster).或者,如果您对查找连接的组件感兴趣(如果两个单词连接,它们就在同一个集群中),它可能会更容易。 Then you would need to replace walktrap.community with components .然后,您需要将walktrap.community替换为components

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM