简体   繁体   中英

r string data cleaning

I have dataset with some weird mix of strings like this

   ID   State
   1    NA
   2    IL
   3    IL,IL,IL
   4    OH,IL
   5    NM,NM,AL,AL
   6    FL,FL,FL

I like to

  • replace State values with NA if they are two different states and
  • replace State values with unique values if they are same but repeated.

Expected dataset

   ID   State
   1    NA
   2    IL
   3    IL
   4    NA
   5    NA
   6    FL

I tried paste(unique(df$State), collapse=",") but that did not work. Any suggestions regarding this is much appreciated. Thanks.

Split the State values on comma ( , ) and replace it with NA values if there are two different values in them. Use distinct to keep only the unique rows for each ID .

library(dplyr)
library(tidyr)

df %>%
  separate_rows(State, sep = ',\\s*') %>%
  group_by(ID) %>%
  mutate(State = replace(State, n_distinct(State) > 1, NA)) %>%
  distinct() %>%
  ungroup()

#     ID State
#  <int> <chr>
#1     1 NA   
#2     2 IL   
#3     3 IL   
#4     4 NA   
#5     5 NA   
#6     6 FL   

Base R solution:

with(
  df,
  vapply(
    State,
    function(x){
      y <- toString(
        unique(
          unlist(
            strsplit(
              x,
              ","
              )
            )
          )
        )
      ifelse(
        grepl(
          ",|NA",
          y),
        NA_character_,
        y
      )
    },
    character(1),
    USE.NAMES = FALSE
  )
)

Tidyverse solution:

library(tidyverse)
str_split(df$State, ",") %>% 
  map(function(x) str_c(unique(x), collapse = ", ")) %>% 
  map_chr(function(y) if_else(str_detect(y, ","), NA_character_, y))

Data:

df <- structure(list(ID = 1:6, State = c(NA, "IL", "IL,IL,IL", "OH,IL", 
"NM,NM,AL,AL", "FL,FL,FL")), class = "data.frame", row.names = c(NA, 
-6L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM