简体   繁体   English

R:将多个 dataframe 列传递给 dplyr::case_when() 作为条件,同时使用列标题作为替换

[英]R: Passing multiple dataframe columns to dplyr::case_when() as condition while using column title as replacement

I want to pass all values in a dataframe as condition to dplyr::case_when() with stringr::str_detect() while using the respective column title als replacement value.我想将 dataframe 中的所有值作为条件传递给 dplyr::case_when() 和 stringr::str_detect(),同时使用相应的列标题 als 替换值。

I have these two data frames:我有这两个数据框:

> print(city_stack)
# A tibble: 11 × 1
   city                    
   <chr>                   
 1 Britz                   
 2 Berlin-Reinickendorf    
 3 Berlin-Kladow           
 4 Berlin-Spindlersfeld    
 5 Berlin-Mahlsdorf        
 6 Berlin-Lichterfelde     
 7 Berlin-Spandau          
 8 Berlin-Biesdorf         
 9 Berlin-Niederschöneweide
10 Rüdersdorf bei Berlin   
11 Berlin-Nordend    

> print(districts_stack)
# A tibble: 10 × 2
   Berlin         Köln               
   <chr>          <chr>              
 1 Adlershof      Rodenkirchen       
 2 Altglienicke   Chorweiler         
 3 Baumschulenweg Ehrenfeld          
 4 Biesdorf       Kalk               
 5 Blankenburg    Lindenthal         
 6 Blankenfelde   Mülheim            
 7 Bohnsdorf      Nippes             
 8 Britz          Porz               
 9 Buch           Kölner Zoo         
10 Buckow         Universität zu Köln

I tried using a nested for loop:我尝试使用嵌套的 for 循环:

for (i in colnames(districts_stack)){
  for (j in districts_stack[[i]]){
    mutate(city_stack, case_when(
      str_detect(city, paste0(j) ~ i,
      TRUE ~ city)
    )
  }
}

While that totally works, this is extremely inefficient and gets problematic with the huge dataframe I am actually working with.虽然这完全可行,但效率极低,并且与我实际使用的巨大 dataframe 存在问题。 I feel like there should be a more efficient solution using purrr::map(), but I wasn't able to come up with anything working.我觉得应该有一个使用 purrr::map() 的更有效的解决方案,但我无法想出任何可行的方法。

dput() of the dataframes:数据帧的 dput():

dput(city_stack[1:11,])
structure(list(city = c("Britz", "Berlin-Reinickendorf", "Berlin-Kladow", 
"Berlin-Spindlersfeld", "Berlin-Mahlsdorf", "Berlin-Lichterfelde", 
"Berlin-Spandau", "Berlin-Biesdorf", "Berlin-Niederschöneweide", 
"Rüdersdorf bei Berlin", "Berlin-Nordend")), row.names = c(NA, 
-11L), class = c("tbl_df", "tbl", "data.frame"))

> dput(districts_stack[1:10,1:2])
structure(list(Berlin = c("Adlershof", "Altglienicke", "Baumschulenweg", 
"Biesdorf", "Blankenburg", "Blankenfelde", "Bohnsdorf", "Britz", 
"Buch", "Buckow"), Köln = c("Rodenkirchen", "Chorweiler", "Ehrenfeld", 
"Kalk", "Lindenthal", "Mülheim", "Nippes", "Porz", "Kölner Zoo", 
"Universität zu Köln")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

I'm not 100% sure the output you're looking for.我不是 100% 确定您正在寻找的 output。 However, I believe this is a step in the right direction.但是,我相信这是朝着正确方向迈出的一步。 Rather than looping over the district values and checking for matches, I propose melting the district_stack data and joining that new df to the city names using a fuzzy string match.与其循环遍历地区值并检查匹配,我建议融合district_stack数据并使用模糊字符串匹配将新的df连接到城市名称。

That is what I understand is happening in the loop.这就是我所理解的循环中正在发生的事情。 You then have a dataframe in which you can replace the city value using if_else more easily.然后,您有一个 dataframe ,您可以在其中更轻松地使用if_else替换city值。

I drew inspiration from this thread: dplyr: inner_join with a partial string match我从这个线程中获得了灵感: dplyr: inner_join with a partial string match

library(tidyverse)
library(fuzzyjoin) # to join the data based on fuzzy matches to get results in one dataframe for easier manipulation

city_stack <- structure(list(city = c("Britz", "Berlin-Reinickendorf", "Berlin-Kladow", 
                        "Berlin-Spindlersfeld", "Berlin-Mahlsdorf", "Berlin-Lichterfelde", 
                        "Berlin-Spandau", "Berlin-Biesdorf", "Berlin-Niederschöneweide", 
                        "Rüdersdorf bei Berlin", "Berlin-Nordend")), row.names = c(NA, 
                                                                                   -11L), class = c("tbl_df", "tbl", "data.frame"))

districts_stack <- structure(list(Berlin = c("Adlershof", "Altglienicke", "Baumschulenweg", 
                               "Biesdorf", "Blankenburg", "Blankenfelde", "Bohnsdorf", "Britz", 
                               "Buch", "Buckow"), Köln = c("Rodenkirchen", "Chorweiler", "Ehrenfeld", 
                                                           "Kalk", "Lindenthal", "Mülheim", "Nippes", "Porz", "Kölner Zoo", 
                                                           "Universität zu Köln")), row.names = c(NA, -10L), class = c("tbl_df", 
                                                                                                                       "tbl", "data.frame")) %>%
  pivot_longer(., cols = everything(), names_to='city', values_to='district') %>%
  arrange(city)

  

city_stack %>% # left join to get all potential string matches, then mutate
  regex_left_join(districts_stack, by = c(city = "district")) %>%
  mutate(city.x = if_else(!is.na(city.y), district, city.x)) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM