简体   繁体   English

根据 R 中的另一列 dataframe 替换一列中的值

[英]Replace values in one column based on another dataframe in R

I have a dataframe with over 20k obs.我有一个超过 20k obs 的 dataframe。 One of the columns is "city names" (df$city).其中一列是“城市名称”(df$city)。 There are over 600 unique city names.有超过 600 个独特的城市名称。 Some of them are misspelled.其中一些拼写错误。

Example of my dataframe:我的 dataframe 示例:

> df$city
[1] "BOSTN" "LOS ANGELOS" "NYC" "CHICAGOO" 
[2] "SEATTLE" "BOSTON" "NEW YORK CITY"

I have a csv file I created that has a list of all the misspelled city names and what the correct name should be.我创建了一个 csv 文件,其中列出了所有拼写错误的城市名称以及正确的名称应该是什么。

> head(city)
           city    city_incorrect
1 BOSTON                    BOSTN
2 LOS ANGELES         LOS ANGELOS
3 NEW YORK CITY               NYC
4 CHICAGO                CHICAGOO

Ideally I would write code that replaces values in df$city based on the "city.csv" file.理想情况下,我会编写代码,根据“city.csv”文件替换 df$city 中的值。

Note: I originally posted this question and someone suggested I use merge, I don't think this is the most efficient way to solve my problem because I would also have to include the 600 correctly spelled cities in my "city.csv" file.注意:我最初发布了这个问题,有人建议我使用合并,我认为这不是解决我的问题的最有效方法,因为我还必须在我的“city.csv”文件中包含 600 个正确拼写的城市。 OR I think I'd need an additional step that combines the two columns from the merge dataframe.或者我认为我需要一个额外的步骤来组合合并 dataframe 中的两列。 So I think it's probably easier to just REPLACE values in df$city based on "city.csv".所以我认为根据“city.csv”替换 df$city 中的值可能更容易。

EDIT: Here's a more detailed look at my dataframe编辑:这里更详细地看一下我的 dataframe

> df[1:5]
id   owner   city            state
1    AAAAA   BOSTN              MA
2    BBBBB   LOS ANGELOS        CA
3    CCCCC   NYC                NY
4    DDDDD   CHICAGOO           IL
5    EEEEE   BOSTON             MA
6    FFFFF   SEATTLE            WA
7    GGGGG   NEW YORK CITY      NY
8    HHHHH   LOS ANGELES        CA

If I use merge or cbind won't it just create another column at the end of my dataframe like this:如果我使用合并或 cbind ,它不会只是在我的 dataframe 的末尾创建另一列,如下所示:

> merge()
id   owner   city            state     city_correct
1    AAAAA   BOSTN              MA           BOSTON
2    BBBBB   LOS ANGELOS        CA      LOS ANGELES
3    CCCCC   NYC                NY    NEW YORK CITY
4    DDDDD   CHICAGOO           IL          CHICAGO
5    EEEEE   BOSTON             MA
6    FFFFF   SEATTLE            WA
7    GGGGG   NEW YORK CITY      NY
8    HHHHH   LOS ANGELES        CA

So the cities with misspelling will be corrected, but the cities that are spelled correctly will be left out.因此,拼写错误的城市将被纠正,但拼写正确的城市将被排除在外。 What I want in the end is one column that has all the corrected city names.我最终想要的是一列包含所有更正的城市名称。

One approach with base::merge() is to include rows in the lookup table that have the correct value of city, and merge that table with the original data. base::merge()的一种方法是在查找表中包含具有正确城市值的行,并将该表与原始数据合并。 We'll call the "correct" city names correctedCity , and merge as follows:我们将把“正确的”城市名称称为correctedCity ,并按如下方式合并:

cityText <- "id,owner,city,state
1,AAAAA,BOSTN,MA
2,BBBBB,LOS ANGELOS,CA
3,CCCCC,NYC,NY
4,DDDDD,CHICAGOO,IL
5,EEEEE,BOSTON,MA
6,FFFFF,SEATTLE,WA
7,GGGGG,NEW YORK CITY,NY
8,HHHHH,LOS ANGELES,CA"

cities <- read.csv(text = cityText, header = TRUE, stringsAsFactors = FALSE)

# first, find all the distinct versions of city
library(sqldf)
distinctCities <- sqldf("select city, count(*) as count from cities group by city")

# create lookup table, and include rows for items that are already correct 
tableText <- "city,correctedCity
BOSTN,BOSTON
BOSTON,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELES,LOS ANGELES
LOS ANGELOS,LOS ANGELES
NEW YORK CITY,NEW YORK CITY
NYC,NEW YORK CITY
SEATTLE,SEATTLE"

cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city")
corrected

...and the output: ...和 output:

> corrected
           city id owner state correctedCity
1         BOSTN  1 AAAAA    MA        BOSTON
2        BOSTON  5 EEEEE    MA        BOSTON
3      CHICAGOO  4 DDDDD    IL       CHIGAGO
4   LOS ANGELES  8 HHHHH    CA   LOS ANGELES
5   LOS ANGELOS  2 BBBBB    CA   LOS ANGELES
6 NEW YORK CITY  7 GGGGG    NY NEW YORK CITY
7           NYC  3 CCCCC    NY NEW YORK CITY
8       SEATTLE  6 FFFFF    WA       SEATTLE
>

at this point one can drop the original values and keep the corrected version.此时可以删除原始值并保留更正的版本。

# rename & keep corrected version
library(dplyr)
corrected %>% select(-city) %>% rename(city = correctedCity) 

An alternative as noted in the comments to the OP would be to create a lookup table that contains rows only for the misspelled city names.如对 OP 的评论中所述,另一种方法是创建一个查找表,其中仅包含拼写错误的城市名称的行。 In this case we would use the argument all.x = TRUE in merge() to keep all rows from the main data frame, and assign the non-missing values of correctedCity to city .在这种情况下,我们将在merge()中使用参数all.x = TRUE来保留主数据框中的所有行,并将correctedCity的非缺失值分配给city

tableText <- "city,correctedCity
BOSTN,BOSTON
CHICAGOO,CHIGAGO
LOS ANGELOS,LOS ANGELES
NYC,NEW YORK CITY"

cityTable <- read.csv(text = tableText,header = TRUE,stringsAsFactors = FALSE)
corrected <- merge(cities,cityTable,by = "city",all.x = TRUE)
corrected$city[!is.na(corrected$correctedCity)] <- corrected$correctedCity[!is.na(corrected$correctedCity)]
corrected

...and the output: ...和 output:

> corrected
           city id owner state correctedCity
1        BOSTON  1 AAAAA    MA        BOSTON
2        BOSTON  5 EEEEE    MA          <NA>
3       CHIGAGO  4 DDDDD    IL       CHIGAGO
4   LOS ANGELES  8 HHHHH    CA          <NA>
5   LOS ANGELES  2 BBBBB    CA   LOS ANGELES
6 NEW YORK CITY  7 GGGGG    NY          <NA>
7 NEW YORK CITY  3 CCCCC    NY NEW YORK CITY
8       SEATTLE  6 FFFFF    WA          <NA>
> 

At this point, correctedCity can be dropped from the data frame.此时,可以从数据框中删除correctedCity

It appears to me that what you're trying to do is match and replace incorrect city names in one dataframe by correct city names in another dataframe.在我看来,您要做的是将一个 dataframe 中的不正确城市名称匹配并替换为另一个 dataframe 中的正确城市名称。 If this is correct then this dplyr solution should work.如果这是正确的,那么这个dplyr解决方案应该可以工作。

Data :数据

Dataframe with pairs of correct and incorrect city names: Dataframe 带有正确和错误的城市名称对:

city <- data.frame(
  city_correct = c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO"),
  city_incorrect = c("BOSTN", "LOS ANGELOS", "NYC", "CHICAGOO"), stringsAsFactors = F)

Dataframe with mix of correct and incorrect city names: Dataframe 混合了正确和错误的城市名称:

set.seed(123)
df <- data.frame(town = sample(c("BOSTON", "LOS ANGELES", "NEW YORK CITY", "CHICAGO","BOSTN", 
                                 "LOS ANGELOS", "NYC", "CHICAGOO"), 20, replace = T), stringsAsFactors = F)

Solution :解决方案

library(dplyr)
df <- left_join(df, city, by = c("town" = "city_incorrect"))
df$town_correct<-ifelse(is.na(df$city_correct), df$town, df$city_correct)
df$city_correct <- NULL

EDIT:编辑:

Another, base R , solution is this:另一个, base R ,解决方案是这样的:

df$town_correct <- ifelse(df$town %in% city$city_incorrect, 
                          city$city_correct[match(df$town, city$city_incorrect)], 
                          df$town[match(df$town, city$city_correct)])

Result :结果

df
            town  town_correct
1  NEW YORK CITY NEW YORK CITY
2            NYC NEW YORK CITY
3        CHICAGO       CHICAGO
4       CHICAGOO       CHICAGO
5       CHICAGOO       CHICAGO
6         BOSTON        BOSTON
7          BOSTN        BOSTON
8       CHICAGOO       CHICAGO
9          BOSTN        BOSTON
10       CHICAGO       CHICAGO
11      CHICAGOO       CHICAGO
12       CHICAGO       CHICAGO
13   LOS ANGELOS   LOS ANGELES
14         BOSTN        BOSTON
15        BOSTON        BOSTON
16      CHICAGOO       CHICAGO
17   LOS ANGELES   LOS ANGELES
18        BOSTON        BOSTON
19 NEW YORK CITY NEW YORK CITY
20      CHICAGOO       CHICAGO

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一个数据框中的列替换列值 - Replace column values based on column in another dataframe 如何基于另一列的值聚合一列的R数据帧 - How to aggregate R dataframe of one column based on values of another 如何基于一个数据框中的列的值和R中另一个数据框的列标题名称有条件地创建新列 - how to conditionally create new column based on the values of a column in one dataframe and the column header names of another dataframe in R 在一个 dataframe 中创建一个列,基于另一个 dataframe 在 R 中的另一列 - Create a column in one dataframe based on another column in another dataframe in R 与 R 中另一个 dataframe 中的列匹配时,替换 dataframe 中的列中的值 - Replace values in column of a dataframe when matching to column in another dataframe in R 使用Apply函数将基于数据框中月份的值替换为r中另一列中的值 - Replace values based on months in a dataframe with values in another column in r, using apply functions R:尝试根据某些条件将一列中的值替换为另一列中的值 - R: Trying to replace values in one column with values in another column based on some conditions 根据条件用另一个数据框替换数据框列 - R - Replace Dataframe column with another dataframe based on conditions - R 根据另一个数据框中的值有条件地替换数据框中的列名 - Conditionally replace column names in a dataframe based on values in another dataframe 如何使用循环替换基于 r 数据框中另一列的平均值的值 - How do I use a loop to replace values with averages based on another column in r dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM