简体   繁体   English

过滤数据集中的行以获取 r 中的不同单词

[英]Filter rows in dataset for distinct words in r

Goal : To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.目标:过滤数据集中的行,以便只保留不同的单词目前,我使用inner_join保留 2 个数据集中的行,这使我在这个数据集中的行重复。

Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked.尝试1:我尝试使用distinct仅保留那些唯一的行,但这没有奏效。 I may be using it incorrectly.我可能使用不正确。

This is my code so far;到目前为止,这是我的代码; output attached in png format output 以 png 格式附加在此处输入图像描述 :


# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are

warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2

coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b

# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches

You can try:你可以试试:

library(dplyr)

warriner2 <- inner_join(warriner, coll, by = "word") %>%
                distinct(word, .keep_all = TRUE)

To even further clarify Ronak's answer, here is an example with some mock data.为了进一步澄清 Ronak 的答案,这里有一个带有一些模拟数据的示例。 Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want.请注意,如果您需要,您可以在 pipe 的末尾使用 distinct() 来保留不同的列。 Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).您的错误很可能已经发生,因为您执行了两次操作,并且两次都将结果分配给了相同的名称(warriner2)。

library(dplyr)

# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")

df1 <- tibble(
        x = sample(5, 99, rep = TRUE),
        y = sample(5, 99, rep = TRUE),
        name = rep(name, times = 33))
df2 <- tibble(
        x = sample(5, 99, rep = TRUE),
        y = sample(5, 99, rep = TRUE),
        name = rep(name, times = 33))

# It's much less confusing if you do this in one pipe
p <- df1 %>%
        inner_join(df2, by = "name") %>%
        distinct()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM