简体   繁体   中英

How can I remove commas while using regex.findall?

Say I have the following string: txt = "Balance: 47,124, age, ... Balance: 1,234..."

(Ellipses denote other text).

I want to use regex to find the list of balances, ie re.findall(r'Balance: (.*)', txt)

But I want to return just 47124 and 1234 instead of 47,124 and 1,234. Obviously I could replace the string afterwards, but that seems like iterating through the string twice, and thereby making this run twice as long.

I'd like to be able to output comma-less results while doing re.findall .

Try using the following regex pattern:

Balance: (\d{1,3}(?:,\d{3})*)

This will match only a comma-separated balance amount, and will not pick up on anything else. Sample script:

txt = "Balance: 47,124, age, ... Balance: 1,234, age ... Balance: 123, age"
amounts = re.findall(r'Balance: (\d{1,3}(?:,\d{3})*)', txt)
amounts = [a.replace(',', '') for a in amounts]
print(amounts)

['47124', '1234', '123']

Here is how the regex pattern works:

\d{1,3}      match an initial 1 to 3 digits
(?:,\d{3})*  followed by `(,ddd)` zero or more times

So the pattern matches 1 to 999, and then allows these same values followed by one or more comma-separated thousands group.

Here's a way to do the replacements as you process each match, which might be slightly more efficient than collecting all the matches and then doing the replacements:

txt = "Balance: 47,124, age, ... Balance: 1,234 ..."
balances = [bal.group(1).replace(',', '') for bal in re.finditer(r'Balance: ([\d,]+)', txt)]
print (balances)

Output:

['47124', '1234']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM