简体   繁体   中英

text cleaning: removing erroneous characters

I have a column called obligations, in it there are financial values for each fiscal year such as "(Project Grants) FY 17$XX.XX; FY 18 est $XX.XX; FY 19 est $XX.XX; FY 16$XX.XX;" I am ultimately trying to select each value and place them into new columns for the correct FY, however, to begin I am trying to use some tools (ie stringr) to remove the noise around the information I want. Not every instance in the column begins with (Project Grants), there are a number of them so I was going to continue using ELIF options in my if statement for the different types. The code did not remove the (Project Grants) from the text which is my issue.

I think it may be better to create a function for this process, but I am new to the language and am not sure where to start or how to go about creating the function hence my choice to remove the characters first then eventually use extract() to create the columns I need.

data %>%
  select(Obligations..122.)%>%
  if(starts_with(Obligations..122.) = "(Project Grants)"){
    str_sub(data$Obligations..122., start = 16)
  }

head(data$Obligations..122.)
[1] "(Project Grants) FY 17$45,381,885.00; FY 18 est $35,000,000.00; FY 19 
est $35,000,000.00; FY 16$45,381,885.00; - "                                                                                                                                                                                                                                                      
[2] "(Salaries and Expenses) FY 17$243,631,584.00; FY 18 est 
$256,467,514.00; FY 19 est $193,289,258.00; FY 16$239,406,515.00; - APHIS 
has a difference between budget authority and obligations because there is 
carryover funding available from no year funding.\n" 

The output would be where I have the original column Obligations..122. followed by FY16/FY17/... and so on with the value you see in the output above.

With the select step, it is only selecting a single column, so, the multiple columns may not be available for if step below. Instead, it could be done with mutate_at

library(dplyr)
library(stringr)
library(tidyr)
data %>%
    mutate_at(vars(starts_with("Obligations..122.")), ~ str_sub(. start = 16))

If it is only a single column, it can be directly selected but make sure that it would be wrappeed with backquotes as there are unusual characters for a column name

data %>%
    group_by(newColumn = str_sub(`Obligations..122.`, start = 16)) %>%
    mutate(ind = row_number(), i1 = 1) %>%
    spread(newColumn, i1, fill = 0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM