简体   繁体   English

是否有 R function 可以清除字符格式的混乱工资?

[英]Is there an R function to clean messy salaries in character format?

I have a column of messy salary data.我有一列混乱的工资数据。 I am wondering if there is a package that has a function made specifically for cleaning this type of messy data.我想知道是否有一个 package 有一个 function 专门用于清理这种类型的混乱数据。 My data looks like:我的数据看起来像:

data.frame(salary = c("40,000-60,000", "40-80K", "$100,000", 
                  "$70/hr", "Between $65-80/hour", "$100k",
                  "50-60,000 a year", "90"))
#>                salary
#> 1       40,000-60,000
#> 2              40-80K
#> 3            $100,000
#> 4              $70/hr
#> 5 Between $65-80/hour
#> 6               $100k
#> 7    50-60,000 a year
#> 8                  90

Created on 2020-12-16 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2020 年 12 月 16 日创建

and I would like the clean column to be a numeric at the annual level.我希望干净的列是年度级别的数字。 I know how to clean this column manually, I'm just wondering if there are any other packages that can help (other than readr::parse_number() )我知道如何手动清理此列,我只是想知道是否有任何其他软件包可以提供帮助(除了readr::parse_number()

The expected output would look like:预期的 output 如下所示:

#>   output
#> 1  50000
#> 2  60000
#> 3 100000
#> 4 145600
#> 5 150800
#> 6 100000
#> 7  55000
#> 8  90000

Here are some first steps you can try.以下是您可以尝试的一些初步步骤。 I define two functions: one replaces a k or K with three zeros.我定义了两个函数:一个用三个零替换kK The other adds leading zeros if one number is denoted in thousands and the other is not.如果一个数字以千表示而另一个不是,则另一个添加前导零。

rem_k <- function(x) {
  sub("(\\d)[kK]", "\\1,000", x)
}

add_zero <- function(x) {
  ifelse(grepl("[1-9]0\\-\\d[0,]{2,}", x), sub("([1-9]0)(\\-\\d[0,]{2,})", "\\1,000\\2", x), x)
}

Finally, I remove all non essential characters:最后,我删除了所有非必要字符:

df %>% 
  mutate(salary2 = gsub("[^0-9,\\-]", "", add_zero(rem_k(salary))))

               salary       salary2
1       40,000-60,000 40,000-60,000
2              40-80K 40,000-80,000
3            $100,000       100,000
4              $70/hr            70
5 Between $65-80/hour         65-80
6               $100k       100,000
7    50-60,000 a year 50,000-60,000
8                  90            90

One option is to create a column 'salary1' with only the digits and the - , then separate it to two columns by the - , mutate the values of those columns, based on the substring matches in the original column ie K|k or hr|hour ie multiply them with 1000 ( K|k ) or for hourly, based on the number of hours for a year, with case_when and get the rowMeans of those columns一种选择是创建仅包含数字和-的列“salary1”,然后通过-将其separate为两列,根据原始列中的mutate匹配,即K|khr|hour改变这些列的值hr|hour即将它们乘以 1000 ( K|k ) 或每小时,基于一年的小时数,使用case_when并获得这些列的rowMeans

library(dplyr)
library(stringr)
library(tidyr)
df1 %>% 
   mutate(salary1 = str_remove_all(salary, '[^0-9-]+')) %>% 
   separate(salary1, into = c('salary1', 'salary2'), 
           convert = TRUE, extra = 'drop') %>%
   mutate(across(c(salary1, salary2),
    ~ case_when(str_detect(salary, "[Kk]") ~ . * 1000, 
               str_detect(salary, 'hr|hour') ~ . * 40 * 4 * 12, 
               nchar(.) < 5 ~ as.numeric(str_pad(., pad = '0', 
                   side = 'right', width = 5)),
             TRUE ~ as.numeric(.)))) %>% 
    transmute(output = rowMeans(select(., salary1, salary2), na.rm = TRUE))

-output -输出

#  output
#1  50000
#2  60000
#3 100000
#4 134400
#5 139200
#6 100000
#7  55000
#8  90000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM