简体   繁体   中英

RegEx to Capture Two Parts of String

I'm scraping some data. One of the data points is tournament prize pools. There are many different currencies in the data. I'd like to extract the amount and currency from each value, so that I can use Google to convert these to a base currency. However, it's been a while since I've used regular expressions, so I'm rusty to say the least. Possible formats of the data are as follows:

$534
$22,136.20
3,200,000 Ft HUF
12,500 kr DKK
50,000 kr SEK
$3,800 AUD
$10,000 NZD
€4,500 EUR
¥100,000 CNY
₹7,000,000 INR
R$39,000 BRL

Below is the first regular expression I came up with.

[0-9,.]+(.+)[AZ]{3}

But that obviously doesn't capture the amount and currency, so I changed it.

([0-9,.]+).+([AZ]{3})

However, there are issues with this regular expression that I can't figure out.

  1. ([0-9,.]+) by itself works fine to capture just the amount.

  2. When I add .+ to that expression, for some reason it stops capturing the trailing 4 and 0 in the first and second test cases respectively. Why?

  3. Then when I add ([AZ]{3}) , it seems to work perfectly for all of the test cases, but obviously selects nothing in the first two.

  4. So I changed it to ([AZ]{0,3}) , which seems to break everything.

What's happening? How can I change the expression so that it works?

This is where I'm at: ([0-9,.]+)((?:.+)([AZ]{3}))?

This should work:

([0-9,.]+).*?([A-Z]{3})?$

A few changes I made:

  • I changed the .+ to .*? because there isn't always something after the number (like the first two cases). I used lazy matching here because otherwise it would match everything till the end.

  • I made group 2 optional with a ? because there isn't always a currency (first 2 cases)

  • I added an end of line anchor $ to make the lazy .*? match something instead of nothing.

If you don't know what "lazy" means in this context, see this post .

Demo

For the example data, you could use an optional non capturing group to match the space and the characters before the currency:

([0-9,.]+)(?:(?: [A-Za-z]+)? ([A-Z]{3}))?

Regex demo

That will match

  • ( Capture group
    • [0-9,.]+ match 1+ times what is listed in the character class
  • ) Close capture group
  • (?: Non capturing group
    • (?: [A-Za-z]+ )? Optional group to match a space, 1+ times a-zA-Z and space
    • ([AZ]{3}) Capture 3 uppercase chars
  • )? Close non capturing group and make it optional

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM