简体   繁体   中英

gsub extracting string

My sample data is:

    c("2\tNO  PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217", 
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156", 
"5\tUNABLE TO WORK  PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185", 
"2\tNO  PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433"

For each line, I'm looking to extract (they are variable names):

Line 1: "PEMJNUM" Line 2: "PRFAMTYP" Line 3: "PUBUS1" Line 4: "PEIO1COW"

My initial goal was to gsub remove the characters to the left and right of each variable name to leave just the variable names, but I was only able to grab everything to the right of the variable name and had issues with grabbing characters to the left. (as shown here https://regexr.com/67r6j ).

Not sure if there's a better way to do this!

You can use sub in the following way:

x <- c("2\tNO  PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217", 
 "1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156", 
 "5\tUNABLE TO WORK  PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185", 
 "2\tNO  PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433")
sub("^(?:.*\\b)?(\\w+)\\s*\\b2\\b.*", "\\1", x, perl=TRUE)
# => [1] "PEMJNUM"  "PRFAMTYP" "PUBUS1"   "PEIO1COW"

See the online regex demo and the R demo .

Details :

  • ^ - start of string
  • (?:.*\\b)? - an optional non-capturing group that matches any zero or more chars (other than line break chars since I use perl=TRUE , if you need to match line breaks, too, add (?s) at the pattern start) as many as possible, and then a word boundary position
  • (\\w+) - Group 1 ( \\1 ): one or more word chars
  • \\s* - zero or more whitespaces
  • \\b - a word boundary
  • 2 - a 2 digit
  • \\b - a word boundary
  • .* - the rest of the line/string.

If there are always whitespaces before 2 , the regex can be written as "^(?:.*\\\\b)?(\\\\w+)\\\\s+2\\\\b.*" .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM