简体   繁体   English

函数式编程问题——map_df & regex

[英]functional programming problems -- map_df & regex

I am trying to combine multiple spreadsheets (about 20) using a functional programming approach.我正在尝试使用函数式编程方法组合多个电子表格(大约 20 个)。 Each spreadsheet contains an individual year of data.每个电子表格都包含一个单独的年份数据。 They are messy, with columns not named or name of same column changing across the spreadsheets.它们很混乱,列未命名或同一列的名称在电子表格中发生变化。

I originally did all the cleaning up individually for each spreadsheet but want to learn how to do it with a functional programming to make it more reproducible.我最初为每个电子表格单独进行了所有清理,但想学习如何使用函数式编程来完成它以使其更具可重复性。

My approach was to build a regex to match all the different names of the specified column, then rename the column using a custom function/regex.我的方法是构建一个正则表达式来匹配指定列的所有不同名称,然后使用自定义函数/正则表达式重命名该列。 I thought I could then use map_dfr to apply this function to all the different spreadsheets to produce a final dataframe to work with.我想我可以使用map_dfr将此 function 应用于所有不同的电子表格,以生成最终的 dataframe 以使用。

However I have encountered 2 problems:但是我遇到了2个问题:

  1. the regex engine in R seems to have the global parameter on and no way to switch it off. R 中的正则表达式引擎似乎打开了全局参数,无法将其关闭。 I want to match the the different possibilities in the regex expression in sequence and stop when it finds the first match, not all matches.我想按顺序匹配正则表达式中的不同可能性,并在找到第一个匹配项时停止,而不是所有匹配项。 For example, after I import the spreadsheets sometimes there are mulitple unamed columns which get given names ...1 etc. I only want to match the first instance.例如,在我导入电子表格后,有时会有多个未命名的列获得给定的名称...1等。我只想匹配第一个实例。 I cannot seem to work out if it possible to disable the global parameter, or a cleverer way of writing the regex to stop after the first match.我似乎无法确定是否可以禁用全局参数,或者编写正则表达式以在第一次匹配后停止的更聪明的方法。 Also is there another, perhaps better, way of approaching this?还有另一种可能更好的方法来解决这个问题吗?

  2. When I pass my custom function, which seems to work well enough on individual dataframes, I get an error from map_df which I am not quite sure why.当我通过我的自定义 function(它似乎在单个数据帧上运行良好)时,我从map_df收到一个错误,我不太清楚为什么。

I have produced a minimal reprex below, which I think highlights the issues.我在下面制作了一个最小的reprex,我认为它突出了这些问题。

All thoughts greatly received, including alternative approaches to this, as this must be a very common problem people come across.所有的想法都受到了极大的欢迎,包括对此的替代方法,因为这一定是人们遇到的一个非常普遍的问题。 Thanks.谢谢。

library(tidyverse)

year_1 <- tribble(
  ~`...1`, ~admissions,
  "Hospital 1", 10,
  "Hospital 2", 100,
  "hospital 3", 200
)

year_2 <- tribble(
  ~provider_code, ~`...2`, ~admissions,
  "H1", "Hospital 1", 20,
  "H2", "Hospital 2", 400,
  "H3", "hospital 3", 500
)

year_3 <- tribble(
  ~"Hospital provider code", ~"Commissioning region/Provider", ~admissions,
  "H1", "Hospital 1", 350,
  "H2", "Hospital 2", 350,
  "H3", "hospital 3", 550
)


clean_up_area_column_name <- function(x){
  rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region|hospital provider", ignore.case = TRUE))
  }

clean_up_area_column_name(year_1)
#> # A tibble: 3 × 2
#>   area       admissions
#>   <chr>           <dbl>
#> 1 Hospital 1         10
#> 2 Hospital 2        100
#> 3 hospital 3        200

clean_up_area_column_name(year_2)
#> # A tibble: 3 × 3
#>   provider_code area       admissions
#>   <chr>         <chr>           <dbl>
#> 1 H1            Hospital 1         20
#> 2 H2            Hospital 2        400
#> 3 H3            hospital 3        500

clean_up_area_column_name(year_3)
#> # A tibble: 3 × 3
#>   area1 area2      admissions
#>   <chr> <chr>           <dbl>
#> 1 H1    Hospital 1        350
#> 2 H2    Hospital 2        350
#> 3 H3    hospital 3        550

test_df <- map_dfr(c(year_1, year_2, year_3), clean_up_area_column_name)
#> Error in UseMethod("rename"): no applicable method for 'rename' applied to an object of class "character"

Created on 2022-08-08 by the reprex package (v2.0.1)reprex package (v2.0.1) 于 2022 年 8 月 8 日创建

Passing multiple data.frames to map requires a list将多个 data.frames 传递给map需要一个list

test_df <- map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)

# A tibble: 9 x 5
  area       admissions provider_code area1 area2     
  <chr>           <dbl> <chr>         <chr> <chr>     
1 Hospital 1         10 NA            NA    NA        
2 Hospital 2        100 NA            NA    NA        
3 hospital 3        200 NA            NA    NA        
4 Hospital 1         20 H1            NA    NA        
5 Hospital 2        400 H2            NA    NA        
6 hospital 3        500 H3            NA    NA        
7 NA                350 NA            H1    Hospital 1
8 NA                350 NA            H2    Hospital 2
9 NA                550 NA            H3    hospital 3

If you only want to grab the first instances, as you say, then the following tweak to your function should work.如您所说,如果您只想获取第一个实例,那么对您的 function 进行以下调整应该可以工作。 Rename any "area1" to "area", then de-select the remaining "area" columns names with trailing digits (area2, area3 etc).将任何“area1”重命名为“area”,然后取消选择带有尾随数字的剩余“area”列名称(area2、area3 等)。

clean_up_area_column_name <- function(x){
rename({{x}}, 
       area = matches("\\.{3}[0-9]|commissioning region|hospital provider")) %>% 
  rename(., area = matches("area1")) %>% 
   select(-any_of(matches("area\\d")))
}

I'm not sure what you expect year_3 to return as it seems your regex is matching the provider_code as well as area :我不确定您期望 year_3 返回什么,因为您的正则表达式似乎与provider_codearea匹配:

map_dfr(list(year_1, year_2, year_3), clean_up_area_column_name)

# A tibble: 9 × 3
  area       admissions provider_code
  <chr>           <dbl> <chr>        
1 Hospital 1         10 NA           
2 Hospital 2        100 NA           
3 hospital 3        200 NA           
4 Hospital 1         20 H1           
5 Hospital 2        400 H2           
6 hospital 3        500 H3           
7 H1                350 NA           
8 H2                350 NA           
9 H3                550 NA  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM