简体   繁体   中英

Compare columns of two different tables and replace specific words in a string R

can someone give me an advice? i try to compare two columns. One column is a string with a address and the other one is just a table with country names. But some country names are in english, which i want to replace in the german term. I also have the problem, that im very limited about using additional packages since i want to use the script in a database. My code dont really works. It just replaces one row.

df1

DE
Europa | Deutschland | München
Europa | England     | London
Europa | Germany     | Berlin
Europa | Italy       | Venedig

df2

GE              EN
Deutschland     Germany
Italien         Italy
England         UK

Result: df1

DE
Europa | Deutschland | München
Europa | England     | London
Europa | Deutschland | Berlin
Europa | Italien     | Venedig

I tried following code:

df1 <- data.frame("DE" = c("Europa | Deutschland | München", "Europa | England | London", "Europa | Germany | Berlin ", "Europa | Italy | Venedig"))
df2 <- data.frame("GE" = c("Deutschland", "Italien", "England"), "EN" = c("Germany", "Italy", "UK"))
df1[] <- lapply(df1, as.character)
df2[] <- lapply(df2, as.character)

for(i in seq_along(df1)) df1$DE <- gsub(df2$EN, df2$GE, df1$DE, fixed = FALSE)

You should add [i] in the for loop and use fixed = TRUE as you use fixed pattern and not the regular expressions. Find other modifications in the code:

for(i in seq_along(df2$EN)) {
    df1$DE <- gsub(df2$EN[i], df2$GE[i], df1$DE, fixed = TRUE)
}
df1$DE

## [1] "Europa | Deutschland | München"
## [2] "Europa | England | London"     
## [3] "Europa | Deutschland | Berlin "
## [4] "Europa | Italien | Venedig" 

ps You can use stringsAsFactors = FALSE in data.frame() to get strings instead of factors:

df1 <- data.frame("DE" = c("Europa | Deutschland | München",
                           "Europa | England | London", 
                           "Europa | Germany | Berlin ",
                           "Europa | Italy | Venedig"),
                  stringsAsFactors = FALSE)

df2 <- data.frame("GE" = c("Deutschland", "Italien", "England"), 
                  "EN" = c("Germany", "Italy", "UK"),
                  stringsAsFactors = FALSE)

Here is a solution based on merge and replace. The reason to split the column is I only want to replace the names in the second column. If we use gsub with a for-loop, there is a possibility that matching words from other columns may also be replaces. df4 is the final output.

Step 1: Separate the column in df1 by | .

df1_1 <- as.data.frame(do.call(rbind, lapply(strsplit(df1$DE, split = "\\|"), trimws)),
                       stringsAsFactors = FALSE)

Step 2: Merge df1_1 and df2

df3 <- merge(df1_1, df2, by.x = "V2", by.y = "EN", all.x = TRUE)

Step 3: Replace the values if the GE column is not NA .

df3$V2 <- ifelse(!is.na(df3$GE), df3$GE, df3$V2)

Step 4: Collapse all columns. Prepare the final output.

df3$DE <- apply(df3[, c("V1", "V2", "V3")], 1, paste, collapse = " | ")
df4 <- df3[, "DE", drop = FALSE] 

df4
#                               DE
# 1 Europa | Deutschland | München
# 2      Europa | England | London
# 3  Europa | Deutschland | Berlin
# 4     Europa | Italien | Venedig

DATA

df1 <- data.frame("DE" = c("Europa | Deutschland | München", "Europa | England | London", "Europa | Germany | Berlin ", "Europa | Italy | Venedig"),
                  stringsAsFactors = FALSE)

df2 <- data.frame("GE" = c("Deutschland", "Italien", "England"), 
                  "EN" = c("Germany", "Italy", "UK"),
                  stringsAsFactors = FALSE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM