简体   繁体   中英

Python regex lookahead non-ASCII character

I have most of this regex down, however I'm having trouble with a lookahead. I want to separate a string into a postcode, followed by two strings or two numbers. The numbers can be of the form:

1
1.5
1.55
11.55

The text for the middle bit can be "No minimum" and the text for the third bit can only be "Free".

Eg

"YO1£ 10Free" ==> YO1; 10; Free

or

"yo1£ 8£ 0.5" ==> yo1; 8; 0.5

or

"yo1No minimum£ 0.75" ==> yo1; No minimum; 0.75

I have the first bit done with this:

string = "YO1£ 10Free"
patternPostCode = re.compile("[a-zA-Z]{1,2}[0-9][a-zA-Z0-9]?")
postCode = re.findall(string,patternPostCode)

The figures in the string are found by:

patternCost = re.compile(r"(?<=\xa3 )([0-9]|  
[0-9][0-9]|  
[0-9]?[0-9]?.[0-9]|
[0-9]?[0-9]?.[0-9][0-9])")

I have difficulty adding the 'or text equals "No minimum"' to the patternCost search. I also can't manage to include the lookahead Â. Adding this at the end doesn't work:

(?<=\xc2)

Any help would be appreciated.

I came up with this on Python 2.7:

# -*- coding: utf-8 -*-
import re

raw_string = "YO1£ 10.01Free"
string = raw_string.decode('utf-8')
patternPostCode = re.compile(u"^(\w{3}.*)\s+(\d+\.?\d*)(\w+)$",flags=re.UNICODE)
postCode = patternPostCode.findall(string)

print postCode
print u'; '.join(postCode[0])

This returns:

[(u'YO1\xc2\xa3', u'10.01', u'Free')]
YO1£; 10.01; Free

First, the raw string I copied from SO appeared to be a bytestring, I had to decode it to unicode (see byte string vs. unicode string. Python ). I think you may be having unicode encoding errors in general - the  symbol is a classic telltale of that.

I then made your regex unicode-friendly, with the re.UNICODE flag. This means you can use \\w to mean "alphanumeric" and \\d to mean "digits" in a unicode-friendly way.

http://docs.python.org/2/library/re.html#module-re

Since regexes are often mistaken for line noise, lemme unpack for you:

u"^(\w{3}.*)\s+(\d+\.?\d*)(\w+)$"
  • ^ = start of line
  • (\\w{3}.*) = match exactly three alphanumeric chars (\\w{3}), followed by anything (.*) and grouped (that's the parenthesis around the whole thing). I don't like the .* in general, but it was was necessary to grab the £ junk. If you don't want it, move it outside the parenthesis.
  • \\s+ - at least one space. we'll throw this away
  • (\\d+.?\\d*) - match one or more digits, followed by an optional period, followed by optionally one or more digits. This'll match 10, 10., 10.0, 10.0000 and so on.
  • (\\w+) - one or more alpha numeric chars
  • $ - match end of line

It's certainly not the prettiest regex I've ever written, but hopefully it's enough to get you started.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM