在R时间序列数据框中，如何基于正则表达式进行分离和分类

Question

Hello I have a time series dataframe comprised of a list of products and their different tax rates that I need to segregate into two categories: percentages numbers(AV) and text(everything else without percentage numbers (SPEC), that are separated by the first plus sign in the character vector:您好，我有一个时间序列数据框，其中包含一个产品列表及其不同的税率，我需要将其分为两类：百分比数字（AV）和文本（没有百分比数字（SPEC）的所有其他内容，由第一个分隔字符向量中的加号：

#note there are many more years
product <- c("01","02")
yr1<-c("0%","11.5% + 190 GBP/100kg")
yr2<-c("0%","15% + 190 GBP/100kg + MAX 8.5%/100kg")
yearnum =2

sched <- data.frame(product,yr1,yr2)

#where yearnum is the number of years
schedule<-c(paste0("yr",1:yearnum))
#categorize av and specific DUTY rates
for(j in 1:yearnum){
  for(i in schedule){
  sched <- sched %>% separate(i, c(paste0("av.yr",j), paste0("spec.yr",j)), " \\+ ", remove=F, extra = "merge")}}

I'm trying to separate them into the result below, but there is something wrong with my for loop formulation.我试图将它们分成下面的结果，但我的 for 循环公式有问题。 Could anyone please help?有人可以帮忙吗？

#and the output should be
product <- c("01","02")
yr1<-c("0%","11.5% + 190 GBP/100kg")
yr2<-c("0%","15% + 190 GBP/100kg + MAX 8.5%/100kg") 
av.yr1<- c("0%","11.5%")
av.yr2 <-c("0%","15%")
spec.yr1 <-c("","190 GBP/100kg")
spec.yr2 <-c("","190 GBP/100kg + MAX 8.5%/100kg")

sched<-data.frame(product,yr1,yr2,av.yr1,av.yr2,spec.yr1,spec.yr2)

Answer 1

If you have lots of years, I think the best thing to do is to pivot your data into long format, use separate or mutate with regular expressions, and pivot_back to wide.如果您有很多年，我认为最好的办法是将您的数据转换为长格式，使用正则表达式separate或mutate ，然后将 pivot_back 转换为宽格式。

pivot_longer(sched, -product) %>%
  separate(value,into=c("av","spec"),sep = " [+] ",extra = "merge") %>% 
  pivot_wider(names_from=name,values_from=av:spec,names_sep = ".")

Output:输出：

  product av.yr1 av.yr2 spec.yr1      spec.yr2                      
  <chr>   <chr>  <chr>  <chr>         <chr>                         
1 01      0%     0%     NA            NA                            
2 02      11.5%  15%    190 GBP/100kg 190 GBP/100kg + MAX 8.5%/100kg

Here is an option using mutate , which retains the original columns as well:这是一个使用mutate的选项，它也保留了原始列：

pivot_longer(sched, -product, values_to = "yr", names_prefix = "yr") %>%
mutate(av.yr = str_extract(yr,"^\\d*[.]?\\d*%"),
       spec.yr = str_remove(yr, "^\\d*[.]?\\d*%( [+] )?")) %>% 
pivot_wider(names_from=name, values_from=yr:spec.yr, names_sep = "")

Output输出

  product yr1                   yr2                                  av.yr1 av.yr2 spec.yr1        spec.yr2                        
  <chr>   <chr>                 <chr>                                <chr>  <chr>  <chr>           <chr>                           
1 01      0%                    0%                                   0%     0%     ""              ""                              
2 02      11.5% + 190 GBP/100kg 15% + 190 GBP/100kg + MAX 8.5%/100kg 11.5%  15%    "190 GBP/100kg" "190 GBP/100kg + MAX 8.5%/100kg"

Answer 2

You only need to iterate over one index:您只需要遍历一个索引：

library(tidyr)
#note there are many more years
product <- c("01","02")
yr1<-c("0%","11.5% + 190 GBP/100kg")
yr2<-c("0%","15% + 190 GBP/100kg + MAX 8.5%/100kg")
yearnum =2

sched <- data.frame(product,yr1,yr2)

#where yearnum is the number of years
schedule<-c(paste0("yr",1:yearnum))
#categorize av and specific DUTY rates
for(j in 1:yearnum){
  i <- schedule[j]
  sched <- sched %>% separate(i, c(paste0("av.yr",j), paste0("spec.yr",j)), 
                              " \\+ ", remove=F, extra = "merge", fill = "right")
}
sched
#>   product                   yr1 av.yr1      spec.yr1
#> 1      01                    0%     0%          <NA>
#> 2      02 11.5% + 190 GBP/100kg  11.5% 190 GBP/100kg
#>                                    yr2 av.yr2                       spec.yr2
#> 1                                   0%     0%                           <NA>
#> 2 15% + 190 GBP/100kg + MAX 8.5%/100kg    15% 190 GBP/100kg + MAX 8.5%/100kg

^{Created on 2022-05-25 by the reprex package (v2.0.1)}^{由reprex 包于 2022-05-25 创建 (v2.0.1)}

在R时间序列数据框中，如何基于正则表达式进行分离和分类

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-05-25 13:08:26

解决方案2
1 2022-05-25 13:02:57

在R时间序列数据框中，如何基于正则表达式进行分离和分类

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-05-25 13:08:26

解决方案2 1 2022-05-25 13:02:57

解决方案1
2 已采纳 2022-05-25 13:08:26

解决方案2
1 2022-05-25 13:02:57