R: Top N elements of each group without duplicates

Question

For a class project I have a set of tweets categorized into 3 types of speech: hate, regular, and offensive. My goal is to eventually train a classifier to predict the correct type of tweet from the data.

I have a tibble of the data in a tidy format (one word per row) containing each word's TF-IDF score. I have censored the offensive language with asterisks:

> tfidf_words
# A tibble: 34,717 x 7
   speech tweet_id word       n    tf   idf tf_idf
   <fct>     <int> <chr>  <int> <dbl> <dbl>  <dbl>
 1 hate   24282747 reason     1 0.25   5.69  1.42 
 2 hate   24282747 usd        1 0.25   8.73  2.18 
 3 hate   24282747 bunch      1 0.25   5.60  1.40 
 4 hate   24282747 ******     1 0.25   5.21  1.30 
 5 hate   24284443 sand       1 0.5    4.76  2.38 
 6 hate   24284443 ******     1 0.5    2.49  1.24 
 7 hate   24324552 madden     1 0.111  8.73  0.970
 8 hate   24324552 call       1 0.111  4.11  0.456
 9 hate   24324552 ******     1 0.111  2.05  0.228
10 hate   24324552 set        1 0.111  5.90  0.655
# ... with 34,707 more rows

To limit the size of my training feature space I want to get the first "n" unique words of each speech type based on their TF-IDF scores.

My vocabulary is a vector of all the unique words chosen for my feature space, defined as vocabulary <- unique(feature_space$word)

In my program I use SENTIMENT_SIZE to define how many words of each speech type I want in my model.

I have tried both this:

feature_space <- tfidf_words %>%
  arrange(desc(tf_idf)) %>%
  group_by(speech) %>%
  slice(1:SENTIMENT_SIZE) %>%
  ungroup() %>%
  arrange(tweet_id)

and this:

feature_space <- tfidf_words %>%
  group_by(speech) %>%
  top_n(n = SENTIMENT_SIZE, wt = tf_idf) %>%
  ungroup()

These both "sort of" work, but they both do not handle duplicates the way I would like. If, for example, I set SENTIMENT_SIZE to be 100 I would want to see 100 unique words from each speech type chosen for a total of 300 words.

Instead, we have this result for method 1:

> length(vocabulary)
[1] 248

And this result for method 2:

> length(vocabulary)
[1] 293

How can I:

Ensure that no duplicate words are chosen in each speech group, and...
Ensure that the words chosen in each group are different from the words in the other groups?

Answer 1

Here I assume you are looking for the unique word within each group of speech

tfidf_words %>% arrange(desc(tf_idf)) %>% 
                group_by(speech) %>% distinct(word, .keep_all = TRUE)

Answer 2

Thanks to @A. Suliman I think I have something that works now.

feature_space <- tfidf_words %>%
  arrange(desc(tf_idf)) %>% 
  distinct(word, .keep_all = TRUE) %>% #remove all duplicate words
  group_by(speech) %>%
  slice(1:SENTIMENT_SIZE) %>% #grab first n of each speech category
  ungroup()

This should always produce the expected number of words in my vocabulary, because it preemptively removes any chance of a tie.

R: Top N elements of each group without duplicates

Question

2 answers

solution1
2 ACCPTED 2018-06-29 02:11:23

solution2
1 2018-06-29 02:37:21

R: Top N elements of each group without duplicates

Question

2 answers

solution1 2 ACCPTED 2018-06-29 02:11:23

solution2 1 2018-06-29 02:37:21

solution1
2 ACCPTED 2018-06-29 02:11:23

solution2
1 2018-06-29 02:37:21