R: move everything after a word to a new column and then only keep the last four digits in the new column

Question

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?

Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz

# delete rows if col 2 has a blank value. 
mintz = mintz[mintz$Substitution.Requirements != "", ]

# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]

#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.)) 

# delete PR
mintz = mintz[-34,]

#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))

I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern

EDIT

I still need help keeping only the state name in column 1.

As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?

mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)

Answer 1

One way could be:

For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4} -> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.

library(dplyr)
library(stringr)

mintz %>% 
  select(State) %>% 
  mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
         State = str_extract(State, paste(state.name, collapse='|'))
         ) %>% 
  na.omit()

            State Year
2         Arizona 2016
3      California 2016
7     Connecticut 2018
12        Florida 2013
13        Georgia 2015
16         Hawaii 2016
21       Illinois 2016
24        Indiana 2014
28           Iowa 2017
32         Kansas 2017
33       Kentucky 2016
34      Louisiana 2015
39       Maryland 2017
42       Michigan 2018
46       Missouri 2016
47        Montana 2017
50       Nebraska 2018
51         Nevada 2018
54  New Hampshire 2018
55     New Jersey 2016
59       New York 2017
62 North Carolina 2015
63   North Dakota 2013
66           Ohio 2017
67         Oregon 2016
70   Pennsylvania 2016
74   Rhode Island 2016
75 South Carolina 2017
78   South Dakota 2019
79      Tennessee 2015
82          Texas 2015
85           Utah 2015
88        Vermont 2018
89       Virginia 2013
92     Washington 2015
93  West Virginia 2018
96      Wisconsin 2019
97        Wyoming 2018

R: move everything after a word to a new column and then only keep the last four digits in the new column

Question

1 answers

solution1
1 2022-06-06 05:37:02

R: move everything after a word to a new column and then only keep the last four digits in the new column

Question

1 answers

solution1 1 2022-06-06 05:37:02

solution1
1 2022-06-06 05:37:02