简体   繁体   中英

How to split the characters of a string by spaces and then resultant elements of list by special characters and numbers and then again join them?

So, what I want to do is to convert some words from the string into their respective words in dictionary and rest as it is.For example by giving input as:

standarisationn("well-2-34 2   @$%23beach bend com")

I want output as:

"well-2-34 2 @$%23bch bnd com"

The codes I was using is:

def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
              "arcade":"arc",
               "apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
               "av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
              "beach":"bch",
              "bend":"bnd",
              "blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
              "boul":"blvd","boulevard":"blvd","boulv":"blvd",
              "bottm":"bot","bottom":"bot",
              "branch":"br","brnch":"br",
              "brdge":"brg","bridge":"brg",
              "bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
              "camp":"cmp",
              "canyn":"cny","canyon":"cny","cnyn":"cny",
              "southwest":"sw" ,"northwest":"nw"}

temp=re.findall(r"[A-Za-z0-9]+|\S", a)
print(temp)
res = []
for wrd in temp:
     res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res) 

but its giving the wrong output as:

'well - 2 - 34 2 @ $ % 23beach bnd com'

that is with too many spaces and not even converting "beach" to "bch".So, that's the issue.What I thought is too first split the string by spaces and then split the resultant elements by special characters and numbers and the use the dictionary and then first join the separated strings by special characters without space and then all the list by space.Can anyone suggest how to go about this or any better method?

You can build you regular expression with the keys of your dictionary, ensuring they're not enclosed in another word (ie not directly preceded nor followed by a letter):

import re
def standarisationn(addr):
    addr = re.sub(r'(,|\s+)', " ", addr)
    lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
                "arcade":"arc",
                "apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
                "av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
                "beach":"bch",
                "bend":"bnd",
                "blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
                "boul":"blvd","boulevard":"blvd","boulv":"blvd",
                "bottm":"bot","bottom":"bot",
                "branch":"br","brnch":"br",
                "brdge":"brg","bridge":"brg",
                "bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
                "camp":"cmp",
                "canyn":"cny","canyon":"cny","cnyn":"cny",
                "southwest":"sw" ,"northwest":"nw"}

    for wrd in lookp_dict:
        addr = re.sub(rf'(?<=[^a-zA-Z]){wrd}(?=[^a-zA-Z])', lookp_dict[wrd], addr)
    return addr

print(standarisationn("well-2-34 2   @$%23beach bend com"))

The expression is built in three parts:

  • (?<=[^a-zA-Z]) is a lookbehind (ie a non capturing expression), checking that the preceding character is a letter
  • {wrd} is the key of your dictionary
  • (?=[^a-zA-Z]) is a lookahead (ie a non capturing expression), checking that the following character is a letter

Output:

well-2-34 2 @$%23bch bnd com

Edit: you can compile a whole expression and use re.sub only once if you replace the loop with:

repl_pattern = re.compile(rf"(?<=[^a-zA-Z])({'|'.join(lookp_dict.keys())})(?=[^a-zA-Z])")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)

This should be much faster if your dictionary grows because we build a single expression with all your dictionary keys:

  • ({'|'.join(lookp_dict.keys())}) is interpreted as (allee|alley|...
  • a lambda function in re.sub replaces the matching element with the corresponding value in lookp_dict (see for example this link for more details about this)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM