简体   繁体   中英

Fuzzy match with Regular Expression

This is my code so far:

for element in address1:
    z = re.match("^\d+$", element)

    if z:
        get_best_fuzzy("1 DEEPALI", address1)

In the above code, I am trying to get the matching addresses in the text file. I would like to get the exact match for house number with approximate match with residual say 80%. But, the above code is not giving me any output nor any error.

Below is the sample for my addresses:

002 TOWER NO. 7 UNIWORLD GARDEN SEC. 47 SOWA ROAD GURGAON Haryana 122001 India
002 TOWER NO. 7 UNIWORLD GARDEN SECTOR-47 SONA ROAD GURGAON Haryana 122001 India
09;SHIVALIK BUNGLAOW; ANANDNAGAR CROSS ROAD; NEAR MADHUR HALL;SATELLITE; 
AHMEDABAD Gujarat 380015 India
1 DEEPALI; PITAMPURA DELHI Delhi 110034 India
10; BRIGHTON TOWERS; CROSS ROAD NO.2; LOKHANDWALA COMPLEX; ANDHERI WEST MUMBAI Maharashtra 400053 India
100 Vaishali; Pitampura Delhi Delhi 110034 India
100 Vaishali; Pitampura; DELHI Delhi 110034 India

Please be explanatory as I am new to this.

^ : asserts position at the start of a line

\\d : matches a digit

+ : matches between one to unlimited times

$ : asserts position at the end of a line

So your regex string ^\\d+$ would only match 1 or 100 , etc exactly, with no additional characters after it.

To get exact match on the house number, try ^\\d+ instead

>>> import re
>>> element = "1 DEEPALI"
>>> z = re.match('^\d+', element)
>>> z
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> z.group(0)
'1'
>>> if z:
...     print('A match is found!')
... 
A match is found!

You can test your regex out using online regex generators like this : https://regex101.com/

I'm not sure what your function get_best_fuzzy does. The error could be arising from there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM