简体   繁体   中英

Extract string between prefix and suffix

I have these columns:

                 text.NANA text.22 text.32
1    Female RNDM_MXN95.tif      No      NA
12     Male RNDM_QOS38.tif      No      NA
13  Female  RNDM_WQW90.tif      No      NA
14    Male  RNDM_BKD94.tif      No      NA
15    Male  RNDM_LGD67.tif      No      NA
16   Female RNDM_AFP45.tif      No      NA

I want to create a column that only has the barcode that starts with RNDM_ and ends with .tif , but not including .tif . The tricky part is to get rid of the gender information that is also in the same column. There are a random amount of spaces between the gender information and the RNDM_ :

                 text.NANA text.22 text.32    BARCODE
1    Female RNDM_MXN95.tif      No      NA RNDM_MXN95
12     Male RNDM_QOS38.tif      No      NA RNDM_QOS38
13  Female  RNDM_WQW90.tif      No      NA RNDM_WQW90
14    Male  RNDM_BKD94.tif      No      NA RNDM_BKD94
15    Male  RNDM_LGD67.tif      No      NA RNDM_LGD67
16   Female RNDM_AFP45.tif      No      NA RNDM_AFP45

I made a very poor attempt with this, but it didn't work:

dfrm$BARCODE <- regexpr("RNDM_", dfrm$text.NANA)
# [1] 8 6 9 7 7 8 9 9 8 8 9 9 6 6 7 8 9 8
# attr(,"match.length")
# [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# attr(,"useBytes")
# [1] TRUE

Please help. Thanks!

So you just want to remove the file extension? Use file_path_sans_ext :

dfrm$BARCODE = file_path_sans_ext(dfrm$text.NANA)

If there's more stuff in front, you can use the following regular expression to extract just the suffix:

dfrm$BARCODE = stringr::str_match(dfrm$text.NANA, '(RNDM_.*)\\.tif')[, 2]

Note that I'm using the {stringr} package here because the base R functions for extracting regex matches are terrible. Nobody uses them.

I strongly recommend against using strsplit here because it's underspecified: from reading the code it's absolutely not clear what the purpose of that code is. Write code that is self-explanatory, not code that requires explanation in a comment.

You can use sapply() and strsplit to do it easy, let me show you:

sapply(strsplit(dfrm$text.NANA, "_"),"[", 1)

That should work.

Edit:

sapply(strsplit(x, "[ .]+"),"[", 2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM