I need to find prices in text document. My code looks like this:
sentence = "This is test text $25,000 $25,000$20,000 $30"
pattern = re.compile(ur'[$€£]?\d+([.,]\d+)?', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall(sentence)
Desired result is:
['$25,000', '$30']
I don't need to include $25,000$20,000 in the result becouse this is not valid result for my task. I need only full word matches.
But i get this result:
['$25,000', '$25,000', '$20,000', '$30']
How to rewrite my regex to include only prices separated by whitespace or punctuation ?
This is as close as I can get it (although there are many people with more regex skills than I have):
pattern = re.compile(ur'(?:^|\s)[$€£]?\d+(?:[.,]\d+)?(?=\s|$)', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall(sentence) # [' $25,000', ' $30']
Try the following:
ur'(?<!\S)[€£$]?\d+(?:[.,]\d+)?(?!\S)'
I added the negative assertions (?<!\\S)
and (?!\\S)
which mean "fail to match if preceded by a non-space" and "fail to match if followed by a non-space" respectively.
Tested:
>>> sentence = "$1234 $56$78.90 This is test text $25,000 $25,000$20,000 $30"
>>> pattern = re.compile(ur'(?<!\S)[€£$]?\d+(?:[.,]\d+)?(?!\S)', re.UNICODE | re.MULTILINE | re.DOTALL)
>>> print pattern.findall(sentence)
['$1234', '$25,000', '$30']
If you want to allow certain non-space characters before or after the match, replace \\S
by [^\\s<chars>]
where <chars>
are the characters you want to allow. Example:
ur'(?<![^\s:])[€£$]?\d+(?:[.,]\d+)?(?![^\s,.])'
allows the pattern to be preceded by a :
and followed by ,
or .
:
>>> sentence = "$1234 $56$78.90 This is test text:$25,000. $45. $25,000$20,000 $30"
>>> print pattern.findall(sentence)
['$1234', '$25,000', '$45', '$30']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.