Break string into several columns using tidyr::extract regex

Question

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:

sasdic <- data.frame(a=c(
  '@1   ANO_CENSO   5.  /*Ano do Censo*/',
  '@71  TP_SEXO $Char1. /*Sexo*/',
  '@72  TP_COR_RACA $Char1. /*Cor/raça*/',
  '@74  FK_COD_PAIS_ORIGEM  4.  /*Código País de origem*/' ))

I would like for the:

first number ([0-9]+) to go to variable "int_pos"
the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this). All other intermidiate caracters (blank spaces, ".", "/", " " should be disconsidered)

This would be the result

d <- data.frame(int_pos=c(1,72,72,74),
                var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
                x=c('5','Chart1','$Char1','4'),
                label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )

I tryed to construct a regular expression for this. This is what I got so far:

sasdic %>% extract(a, c('int_pos','var_name','x','label'), 
                   "([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label") 
             -> d

above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.

Answer 1

In the regex used, we are matchng one more more punctuation characters ( [[:punct:]]+ ) ie @ followed by capturing the numeric part ( (\\\\d+) - this will be our first column of interest), followed by one or more white-space ( \\\\s+ ), followed by the second capture group ( \\\\S+ - one or more non white-space character ie "ANO_CENSO" for the first row), followed by space ( \\\\s+ ), then we capture the third group ( ([[:alum:]$]+) - ie one or more characters that include the alpha numeric along with $ so as to match $Char1 ), next we match one or more characters that are not a letter ( [^A-Za-z]+ - this should get rid of the space and * ) and the last part we capture one or more characters that are not * ( ([^*]+) .

sasdic %>% 
      extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
   "[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")

#  int_pos           var_name      x                 label
#1       1          ANO_CENSO      5          Ano do Censo
#2      71            TP_SEXO $Char1                  Sexo
#3      72        TP_COR_RACA $Char1              Cor/raça
#4      74 FK_COD_PAIS_ORIGEM      4 Código País de origem

Answer 2

This is another option, though it uses the data.table package instead of tidyr:

library(data.table)
setDT(sasdic)

# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]                   
# remove leading "@", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^@","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]                                            

sasdic

#                    label int_pos           var_name       x
# 1:          Ano do Censo       1          ANO_CENSO      5.
# 2:                  Sexo      71            TP_SEXO $Char1.
# 3:              Cor/raça      72        TP_COR_RACA $Char1.
# 4: Código País de origem      74 FK_COD_PAIS_ORIGEM      4.

This assumes that the "remaining parts" (aside from the label) are space-separated.

This could also be done in one block (which is what I would do):

sasdic[, c("a","label","int_pos","var_name","x") := {
  x   = tstrsplit(a, "/\\*|\\*/")
  x1s = tstrsplit(gsub("^@","",x[[1]])," +")
  c(list(NULL), x1s, x[2])
}]

Break string into several columns using tidyr::extract regex

Question

2 answers

solution1
7 ACCPTED 2016-01-15 16:12:54

solution2
7 2016-01-15 16:19:31

Break string into several columns using tidyr::extract regex

Question

2 answers

solution1 7 ACCPTED 2016-01-15 16:12:54

solution2 7 2016-01-15 16:19:31

solution1
7 ACCPTED 2016-01-15 16:12:54

solution2
7 2016-01-15 16:19:31