[英]Compare columns of two different tables and replace specific words in a string R
can someone give me an advice? 有人可以给我建议吗? i try to compare two columns. 我尝试比较两列。 One column is a string with a address and the other one is just a table with country names. 一栏是带有地址的字符串,另一栏只是带有国家名称的表。 But some country names are in english, which i want to replace in the german term. 但是有些国家/地区名称是英文的,我想用德语代替。 I also have the problem, that im very limited about using additional packages since i want to use the script in a database. 我也有一个问题,因为我想在数据库中使用脚本,所以我对使用其他软件包非常有限。 My code dont really works. 我的代码不起作用。 It just replaces one row. 它只是替换一行。
df1 DF1
DE
Europa | Deutschland | München
Europa | England | London
Europa | Germany | Berlin
Europa | Italy | Venedig
df2 DF2
GE EN
Deutschland Germany
Italien Italy
England UK
Result: df1 结果:df1
DE
Europa | Deutschland | München
Europa | England | London
Europa | Deutschland | Berlin
Europa | Italien | Venedig
I tried following code: 我尝试了以下代码:
df1 <- data.frame("DE" = c("Europa | Deutschland | München", "Europa | England | London", "Europa | Germany | Berlin ", "Europa | Italy | Venedig"))
df2 <- data.frame("GE" = c("Deutschland", "Italien", "England"), "EN" = c("Germany", "Italy", "UK"))
df1[] <- lapply(df1, as.character)
df2[] <- lapply(df2, as.character)
for(i in seq_along(df1)) df1$DE <- gsub(df2$EN, df2$GE, df1$DE, fixed = FALSE)
You should add [i]
in the for
loop and use fixed = TRUE
as you use fixed pattern and not the regular expressions. 您应该在for
循环中添加[i]
,并在使用固定模式而不是正则表达式时使用fixed = TRUE
。 Find other modifications in the code: 在代码中查找其他修改:
for(i in seq_along(df2$EN)) {
df1$DE <- gsub(df2$EN[i], df2$GE[i], df1$DE, fixed = TRUE)
}
df1$DE
## [1] "Europa | Deutschland | München"
## [2] "Europa | England | London"
## [3] "Europa | Deutschland | Berlin "
## [4] "Europa | Italien | Venedig"
ps You can use stringsAsFactors = FALSE
in data.frame()
to get strings instead of factors: ps您可以在data.frame()
使用stringsAsFactors = FALSE
来获取字符串而不是因子:
df1 <- data.frame("DE" = c("Europa | Deutschland | München",
"Europa | England | London",
"Europa | Germany | Berlin ",
"Europa | Italy | Venedig"),
stringsAsFactors = FALSE)
df2 <- data.frame("GE" = c("Deutschland", "Italien", "England"),
"EN" = c("Germany", "Italy", "UK"),
stringsAsFactors = FALSE)
Here is a solution based on merge
and replace. 这是基于merge
和替换的解决方案。 The reason to split the column is I only want to replace the names in the second column. 拆分列的原因是我只想替换第二列中的名称。 If we use gsub
with a for-loop, there is a possibility that matching words from other columns may also be replaces. 如果我们将gsub
与for循环一起使用,则可能还会替换其他列中的匹配词。 df4
is the final output. df4
是最终输出。
Step 1: Separate the column in df1
by |
步骤1:用|
分隔df1
的列: . 。
df1_1 <- as.data.frame(do.call(rbind, lapply(strsplit(df1$DE, split = "\\|"), trimws)),
stringsAsFactors = FALSE)
Step 2: Merge df1_1
and df2
步骤2:合并df1_1
和df2
df3 <- merge(df1_1, df2, by.x = "V2", by.y = "EN", all.x = TRUE)
Step 3: Replace the values if the GE
column is not NA
. 步骤3:如果GE
列不是NA
则替换值。
df3$V2 <- ifelse(!is.na(df3$GE), df3$GE, df3$V2)
Step 4: Collapse all columns. 步骤4:折叠所有栏。 Prepare the final output. 准备最终输出。
df3$DE <- apply(df3[, c("V1", "V2", "V3")], 1, paste, collapse = " | ")
df4 <- df3[, "DE", drop = FALSE]
df4
# DE
# 1 Europa | Deutschland | München
# 2 Europa | England | London
# 3 Europa | Deutschland | Berlin
# 4 Europa | Italien | Venedig
DATA 数据
df1 <- data.frame("DE" = c("Europa | Deutschland | München", "Europa | England | London", "Europa | Germany | Berlin ", "Europa | Italy | Venedig"),
stringsAsFactors = FALSE)
df2 <- data.frame("GE" = c("Deutschland", "Italien", "England"),
"EN" = c("Germany", "Italy", "UK"),
stringsAsFactors = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.