简体   繁体   中英

regular expressions in python need to retain special characters

Below is my unclean text string

text = 'this/r/n/r/nis a non-U.S disclosures/n/n/r/r analysis agreements disclaimer./r/n/n/nPlease keep it confidential' 

below is the regexp i'm using:

 ' '.join(re.findall(r'\b(\w+)\b', text))

my output is:

'this is a non US disclosures analysis agreements disclaimer. Please keep it confidential'

my expected output is:

 'this is a non-U.S disclosures analysis agreements disclaimer. Please keep it confidential'

I need to retain special characters and space between the words, there should be exactly one space. can anyone help me to alter my regexp?

Hope this works for you!

str = 'this/r/n/r/nis a non-US disclosures/n/n/r/r analysis agreements disclaimer./r/n/n/nPlease keep it confidential'

val = re.sub(r'(/.?)', " ", str); val1 = re.sub(r'\\s+', " ", val) print(val1)

Use a more specific word barrier than \\b ($ which marks the end of a string can't be placed inside square brackets so you have to make the or explicit in $|\\n|\\r| and the ?= is a non consuming look ahead much like \\b), also safer here is using a non greedy non empty accumulator (the + sign makes it non empty and the question mark makes it non greedy):

re.findall(r'[^\n\r ]+?(?=$|\n|\r| )', text)

['this', 'is', 'a', 'non-U.S', 'disclosures', 'analysis', 'agreements', 'disclaimer.', 'Please', 'keep', 'it', 'confidential']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM