如何在 R 中的两个数据帧之间查找和替换值

Question

我有一个来自 tidytext 的数据框，其中包含一些调查自由回复评论中的单个单词。 它只有不到 500,000 行。 作为自由反应数据，它充满了错别字。 使用 textclean::replace_misspellings 处理了近 13,000 个拼写错误的单词，但仍有大约 700 个我手动识别的独特拼写错误。

我现在有一个包含两列的第二个表，第一个是拼写错误，第二个是更正。

例如

allComments <- data.frame("Number" = 1:5, "Word" = c("organization","orginization", "oragnization", "help", "hlp"))
misspellings <- data.frame("Wrong" = c("orginization", "oragnization", "hlp"), "Right" = c("organization", "organization", "help"))

如何将与allComments$word misspellings$wrong匹配的allComments$word所有值替换为allComments$word misspellings$right ？

我觉得这可能是非常基本的，而且我的 R 无知正在显示......

Answer 1

您可以使用match在allComments$Word misspellings$Wrong从allComments$Word中查找单词的索引，然后使用此索引对它们进行子集化。

tt <- match(allComments$Word, misspellings$Wrong)
allComments$Word[!is.na(tt)]  <- misspellings$Right[tt[!is.na(tt)]]
allComments
#  Number         Word
#1      1 organization
#2      2 organization
#3      3 organization
#4      4         help
#5      5         help

如果allComments$Word还没有正确的单词， allComments$Word其转换为character ：

allComments$Word <- as.character(allComments$Word)

Answer 2

这是另一个使用replace()基本 R 解决方案

allComments <- within(allComments, 
                      Word <- replace(Word,
                                      which(!is.na(match(Word,misspellings$Wrong))),
                                      na.omit(misspellings$Right[match(Word,misspellings$Wrong)])))

以至于

> allComments
  Number         Word
1      1 organization
2      2 organization
3      3 organization
4      4         help
5      5         help

Answer 3

allComments %>%
  left_join(misspellings, by = c("Word" = "Wrong")) %>%
  mutate(Word = coalesce(as.character(Right), Word))
#   Number         Word        Right
# 1      1 organization         <NA>
# 2      2 organization organization
# 3      3 organization organization
# 4      4         help         <NA>
# 5      5         help         help

当然，您可以在完成后删除Right列。

Answer 4

这是一个data.table解决方案：

library(data.table)
setDT(allComments)
setDT(misspellings)
df <- merge.data.table(allComments, misspellings, all.x = T, by.x = "Word", by.y = "Wrong")
df <- df[!(is.na(Right)), Word := Right]
df <- df[, c("Number", "Word")]
df <- df[order(Number)]
df

#    Number         Word
#1:      1  organization
#2:      2  organization
#3:      3  organization
#4:      4          help
#5:      5          help

如何在 R 中的两个数据帧之间查找和替换值

问题描述

4 个解决方案

解决方案1
5 已采纳 2020-01-07 15:06:55

解决方案2
3 2020-01-07 15:36:27

解决方案3
2 2020-01-07 15:10:12

解决方案4
1 2020-01-07 15:43:59

如何在 R 中的两个数据帧之间查找和替换值

问题描述

4 个解决方案

解决方案1 5 已采纳 2020-01-07 15:06:55

解决方案2 3 2020-01-07 15:36:27

解决方案3 2 2020-01-07 15:10:12

解决方案4 1 2020-01-07 15:43:59

解决方案1
5 已采纳 2020-01-07 15:06:55

解决方案2
3 2020-01-07 15:36:27

解决方案3
2 2020-01-07 15:10:12

解决方案4
1 2020-01-07 15:43:59