简体   繁体   中英

R: Top N elements of each group without duplicates

For a class project I have a set of tweets categorized into 3 types of speech: hate, regular, and offensive. My goal is to eventually train a classifier to predict the correct type of tweet from the data.

I have a tibble of the data in a tidy format (one word per row) containing each word's TF-IDF score. I have censored the offensive language with asterisks:

> tfidf_words
# A tibble: 34,717 x 7
   speech tweet_id word       n    tf   idf tf_idf
   <fct>     <int> <chr>  <int> <dbl> <dbl>  <dbl>
 1 hate   24282747 reason     1 0.25   5.69  1.42 
 2 hate   24282747 usd        1 0.25   8.73  2.18 
 3 hate   24282747 bunch      1 0.25   5.60  1.40 
 4 hate   24282747 ******     1 0.25   5.21  1.30 
 5 hate   24284443 sand       1 0.5    4.76  2.38 
 6 hate   24284443 ******     1 0.5    2.49  1.24 
 7 hate   24324552 madden     1 0.111  8.73  0.970
 8 hate   24324552 call       1 0.111  4.11  0.456
 9 hate   24324552 ******     1 0.111  2.05  0.228
10 hate   24324552 set        1 0.111  5.90  0.655
# ... with 34,707 more rows

To limit the size of my training feature space I want to get the first "n" unique words of each speech type based on their TF-IDF scores.

My vocabulary is a vector of all the unique words chosen for my feature space, defined as vocabulary <- unique(feature_space$word)

In my program I use SENTIMENT_SIZE to define how many words of each speech type I want in my model.

I have tried both this:

feature_space <- tfidf_words %>%
  arrange(desc(tf_idf)) %>%
  group_by(speech) %>%
  slice(1:SENTIMENT_SIZE) %>%
  ungroup() %>%
  arrange(tweet_id)

and this:

feature_space <- tfidf_words %>%
  group_by(speech) %>%
  top_n(n = SENTIMENT_SIZE, wt = tf_idf) %>%
  ungroup()

These both "sort of" work, but they both do not handle duplicates the way I would like. If, for example, I set SENTIMENT_SIZE to be 100 I would want to see 100 unique words from each speech type chosen for a total of 300 words.

Instead, we have this result for method 1:

> length(vocabulary)
[1] 248

And this result for method 2:

> length(vocabulary)
[1] 293

How can I:

  1. Ensure that no duplicate words are chosen in each speech group, and...
  2. Ensure that the words chosen in each group are different from the words in the other groups?

Here I assume you are looking for the unique word within each group of speech

tfidf_words %>% arrange(desc(tf_idf)) %>% 
                group_by(speech) %>% distinct(word, .keep_all = TRUE) 

Thanks to @A. Suliman I think I have something that works now.

feature_space <- tfidf_words %>%
  arrange(desc(tf_idf)) %>% 
  distinct(word, .keep_all = TRUE) %>% #remove all duplicate words
  group_by(speech) %>%
  slice(1:SENTIMENT_SIZE) %>% #grab first n of each speech category
  ungroup()

This should always produce the expected number of words in my vocabulary, because it preemptively removes any chance of a tie.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM