简体   繁体   中英

R - extracting multiple patterns from string using gregexpr

I am working with a dataset where I have a column describing different products. In the product description is also the weight of the product, which is what I'd like to extract. My problem is that some products come in dual-packs, meaning that the description starts with '2x', while the actual weight is at the end of the description. For example:

x = '2x pet food brand 12kg'

What I'd like to do is to shorten this to just 2x12kg. I'm not great at using regexp in R and was hoping that someone here could help me.

I have tried doing this using gregexp in the following way:

m <- gregexpr("(^[0-9]+x [0-9]+kg)", x)

Unfortunately this only gives me '10kg' not including the '2x'

I would appreciate any help at all with this.

EDIT ----

After sorting out my initial problem, I found that there were a few instances in the data of a different format, which I also like to extract:

x = 'Pet food brand 15x85g'
# Should be:
x = '15x85g'

I have tried to play around with OR statements in gsub, like:

m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+kg)|([0-9]+x)?[^0-9]*([0-9.]+g)', '\\1\\2', x)
#And
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+(kg|g)), x)

While this still extracts the kilos, it only removes the instances with grams and leaves the rest of the string, like:

x = 'Pet food brand    '

Or running gsub a second time using:

m <- gsub('([0-9]+x[0-9]+g)', '\\1', x)

The latter option does not extract the product weights at all, and just leaves the string intact.

Sorry for not noticing that the strings were formatted differently earlier. Again, any help would be appreciated.

You could use this regular expression

m = gregexpr("([0-9]+x|[0-9.]+kg)", string, ignore.case = T)
result = regmatches(string, m)
r = paste0(unlist(result),collapse = "")

For string = "2x pet food brand 12kg" you get "2x12kg"

This also works if kilograms have decimals:

For string = "23x pet food 23.5Kg" you get "23x23.5Kg"

(edited to correct mistake pointed out by @R. Schifini)

You can use regex like this:

x <- '2x pet food brand 12kg'

gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)

## "2x12kg"

This would get you the weight even if there is no "2x" in the beginning of the string:

x <- 'pet food brand 12kg'

gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)

## "12kg"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM