简体   繁体   English

R:每个组的前N个元素,无重复

[英]R: Top N elements of each group without duplicates

For a class project I have a set of tweets categorized into 3 types of speech: hate, regular, and offensive. 对于一个课堂项目,我有一组推文,分为3种类型的演讲:仇恨,常规和令人反感。 My goal is to eventually train a classifier to predict the correct type of tweet from the data. 我的目标是最终训练分类器,以根据数据预测正确的推文类型。

I have a tibble of the data in a tidy format (one word per row) containing each word's TF-IDF score. 我以整齐的格式(每行一个单词)来整理数据,其中包含每个单词的TF-IDF分数。 I have censored the offensive language with asterisks: 我用星号检查了攻击性语言:

> tfidf_words
# A tibble: 34,717 x 7
   speech tweet_id word       n    tf   idf tf_idf
   <fct>     <int> <chr>  <int> <dbl> <dbl>  <dbl>
 1 hate   24282747 reason     1 0.25   5.69  1.42 
 2 hate   24282747 usd        1 0.25   8.73  2.18 
 3 hate   24282747 bunch      1 0.25   5.60  1.40 
 4 hate   24282747 ******     1 0.25   5.21  1.30 
 5 hate   24284443 sand       1 0.5    4.76  2.38 
 6 hate   24284443 ******     1 0.5    2.49  1.24 
 7 hate   24324552 madden     1 0.111  8.73  0.970
 8 hate   24324552 call       1 0.111  4.11  0.456
 9 hate   24324552 ******     1 0.111  2.05  0.228
10 hate   24324552 set        1 0.111  5.90  0.655
# ... with 34,707 more rows

To limit the size of my training feature space I want to get the first "n" unique words of each speech type based on their TF-IDF scores. 为了限制我的训练特征空间的大小,我想根据它们的TF-IDF分数获得每种语音类型的前“ n”个唯一词。

My vocabulary is a vector of all the unique words chosen for my feature space, defined as vocabulary <- unique(feature_space$word) 我的vocabulary是为我的特征空间选择的所有唯一单词的向量,定义为vocabulary <- unique(feature_space$word)

In my program I use SENTIMENT_SIZE to define how many words of each speech type I want in my model. 在我的程序中,我使用SENTIMENT_SIZE定义我的模型中每种语音类型需要多少个单词。

I have tried both this: 我已经尝试过这两个:

feature_space <- tfidf_words %>%
  arrange(desc(tf_idf)) %>%
  group_by(speech) %>%
  slice(1:SENTIMENT_SIZE) %>%
  ungroup() %>%
  arrange(tweet_id)

and this: 和这个:

feature_space <- tfidf_words %>%
  group_by(speech) %>%
  top_n(n = SENTIMENT_SIZE, wt = tf_idf) %>%
  ungroup()

These both "sort of" work, but they both do not handle duplicates the way I would like. 这些都是“分类”工作,但是它们都不以我希望的方式处理重复项。 If, for example, I set SENTIMENT_SIZE to be 100 I would want to see 100 unique words from each speech type chosen for a total of 300 words. 例如,如果我将SENTIMENT_SIZE设置为100,则希望从所选的每种语音类型中看到100个唯一的单词,总共300个单词。

Instead, we have this result for method 1: 相反,方法1的结果如下:

> length(vocabulary)
[1] 248

And this result for method 2: 方法2的结果

> length(vocabulary)
[1] 293

How can I: 我怎样才能:

  1. Ensure that no duplicate words are chosen in each speech group, and... 确保在每个语音组中没有选择重复的单词,并...
  2. Ensure that the words chosen in each group are different from the words in the other groups? 确保每个组中选择的词与其他组中的词不同?

Here I assume you are looking for the unique word within each group of speech 在这里,我假设您正在寻找每组speech的唯一word

tfidf_words %>% arrange(desc(tf_idf)) %>% 
                group_by(speech) %>% distinct(word, .keep_all = TRUE) 

Thanks to @A. 感谢@A。 Suliman I think I have something that works now. 苏里曼,我想我现在有一些可行的方法。

feature_space <- tfidf_words %>%
  arrange(desc(tf_idf)) %>% 
  distinct(word, .keep_all = TRUE) %>% #remove all duplicate words
  group_by(speech) %>%
  slice(1:SENTIMENT_SIZE) %>% #grab first n of each speech category
  ungroup()

This should always produce the expected number of words in my vocabulary, because it preemptively removes any chance of a tie. 这应该总是能在我的词汇表中产生预期的单词数,因为它可以抢先消除平局的机会。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM