Python re capture one match per word

Question

I need to find prices in text document. My code looks like this:

sentence = "This is test text $25,000 $25,000$20,000 $30"
pattern = re.compile(ur'[$€£]?\d+([.,]\d+)?', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall(sentence)

Desired result is:

['$25,000', '$30']

I don't need to include $25,000$20,000 in the result becouse this is not valid result for my task. I need only full word matches.

But i get this result:

['$25,000', '$25,000', '$20,000', '$30']

How to rewrite my regex to include only prices separated by whitespace or punctuation ?

Answer 1

This is as close as I can get it (although there are many people with more regex skills than I have):

pattern = re.compile(ur'(?:^|\s)[$€£]?\d+(?:[.,]\d+)?(?=\s|$)', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall(sentence) # [' $25,000', ' $30']

Answer 2

Try the following:

ur'(?<!\S)[€£$]?\d+(?:[.,]\d+)?(?!\S)'

I added the negative assertions (?<!\\S) and (?!\\S) which mean "fail to match if preceded by a non-space" and "fail to match if followed by a non-space" respectively.

Tested:

>>> sentence = "$1234 $56$78.90 This is test text $25,000 $25,000$20,000 $30"
>>> pattern = re.compile(ur'(?<!\S)[€£$]?\d+(?:[.,]\d+)?(?!\S)', re.UNICODE | re.MULTILINE | re.DOTALL)
>>> print pattern.findall(sentence)
['$1234', '$25,000', '$30']

If you want to allow certain non-space characters before or after the match, replace \\S by [^\\s<chars>] where <chars> are the characters you want to allow. Example:

ur'(?<![^\s:])[€£$]?\d+(?:[.,]\d+)?(?![^\s,.])'

allows the pattern to be preceded by a : and followed by , or . :

>>> sentence = "$1234 $56$78.90 This is test text:$25,000. $45. $25,000$20,000 $30"
>>> print pattern.findall(sentence)
['$1234', '$25,000', '$45', '$30']

Python re capture one match per word

Question

2 answers

solution1
1 2012-09-25 02:22:59

solution2
1 ACCPTED 2012-09-25 02:49:51

Python re capture one match per word

Question

2 answers

solution1 1 2012-09-25 02:22:59

solution2 1 ACCPTED 2012-09-25 02:49:51

solution1
1 2012-09-25 02:22:59

solution2
1 ACCPTED 2012-09-25 02:49:51