简体   繁体   中英

Break string into several columns using tidyr::extract regex

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:

sasdic <- data.frame(a=c(
  '@1   ANO_CENSO   5.  /*Ano do Censo*/',
  '@71  TP_SEXO $Char1. /*Sexo*/',
  '@72  TP_COR_RACA $Char1. /*Cor/raça*/',
  '@74  FK_COD_PAIS_ORIGEM  4.  /*Código País de origem*/' )) 

I would like for the:

  • first number ([0-9]+) to go to variable "int_pos"
  • the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
  • The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
  • Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this). All other intermidiate caracters (blank spaces, ".", "/", " " should be disconsidered)

This would be the result

d <- data.frame(int_pos=c(1,72,72,74),
                var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
                x=c('5','Chart1','$Char1','4'),
                label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )

I tryed to construct a regular expression for this. This is what I got so far:

sasdic %>% extract(a, c('int_pos','var_name','x','label'), 
                   "([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label") 
             -> d

above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.

In the regex used, we are matchng one more more punctuation characters ( [[:punct:]]+ ) ie @ followed by capturing the numeric part ( (\\\\d+) - this will be our first column of interest), followed by one or more white-space ( \\\\s+ ), followed by the second capture group ( \\\\S+ - one or more non white-space character ie "ANO_CENSO" for the first row), followed by space ( \\\\s+ ), then we capture the third group ( ([[:alum:]$]+) - ie one or more characters that include the alpha numeric along with $ so as to match $Char1 ), next we match one or more characters that are not a letter ( [^A-Za-z]+ - this should get rid of the space and * ) and the last part we capture one or more characters that are not * ( ([^*]+) .

sasdic %>% 
      extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
   "[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")

#  int_pos           var_name      x                 label
#1       1          ANO_CENSO      5          Ano do Censo
#2      71            TP_SEXO $Char1                  Sexo
#3      72        TP_COR_RACA $Char1              Cor/raça
#4      74 FK_COD_PAIS_ORIGEM      4 Código País de origem

This is another option, though it uses the data.table package instead of tidyr:

library(data.table)
setDT(sasdic)

# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]                   
# remove leading "@", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^@","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]                                            

sasdic

#                    label int_pos           var_name       x
# 1:          Ano do Censo       1          ANO_CENSO      5.
# 2:                  Sexo      71            TP_SEXO $Char1.
# 3:              Cor/raça      72        TP_COR_RACA $Char1.
# 4: Código País de origem      74 FK_COD_PAIS_ORIGEM      4.

This assumes that the "remaining parts" (aside from the label) are space-separated.


This could also be done in one block (which is what I would do):

sasdic[, c("a","label","int_pos","var_name","x") := {
  x   = tstrsplit(a, "/\\*|\\*/")
  x1s = tstrsplit(gsub("^@","",x[[1]])," +")
  c(list(NULL), x1s, x[2])
}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM