简体   繁体   English

python中的正则表达式需要保留特殊字符

[英]regular expressions in python need to retain special characters

Below is my unclean text string 下面是我的不干净的文本字符串

text = 'this/r/n/r/nis a non-U.S disclosures/n/n/r/r analysis agreements disclaimer./r/n/n/nPlease keep it confidential' 

below is the regexp i'm using: 以下是我正在使用的正则表达式:

 ' '.join(re.findall(r'\b(\w+)\b', text))

my output is: 我的输出是:

'this is a non US disclosures analysis agreements disclaimer. Please keep it confidential'

my expected output is: 我的预期输出是:

 'this is a non-U.S disclosures analysis agreements disclaimer. Please keep it confidential'

I need to retain special characters and space between the words, there should be exactly one space. 我需要在单词之间保留特殊字符和空格,应该恰好有一个空格。 can anyone help me to alter my regexp? 谁能帮我改变我的正则表达式?

Hope this works for you! 希望这对您有用!

str = 'this/r/n/r/nis a non-US disclosures/n/n/r/r analysis agreements disclaimer./r/n/n/nPlease keep it confidential' str ='此/ r / n / r / nis非美国披露/ n / n / r / r分析协议免责声明。/r/n/n/n请对其保密”

val = re.sub(r'(/.?)', " ", str); val = re.sub(r'(/。?)',“”,str); val1 = re.sub(r'\\s+', " ", val) print(val1) val1 = re.sub(r'\\ s +',“”,val)print(val1)

Use a more specific word barrier than \\b ($ which marks the end of a string can't be placed inside square brackets so you have to make the or explicit in $|\\n|\\r| and the ?= is a non consuming look ahead much like \\b), also safer here is using a non greedy non empty accumulator (the + sign makes it non empty and the question mark makes it non greedy): 使用比\\ b($表示字符串的末尾不能放在方括号内,因此您必须在$ | \\ n | \\ r |中使用或显式,而?=是非像\\ b一样使用前瞻,这里也更安全的是使用非贪婪非空累加器(+号使其成为非空,问号使其成为非贪婪):

re.findall(r'[^\n\r ]+?(?=$|\n|\r| )', text)

['this', 'is', 'a', 'non-U.S', 'disclosures', 'analysis', 'agreements', 'disclaimer.', 'Please', 'keep', 'it', 'confidential'] [“此”,“是”,“一个”,“非美国”,“披露”,“分析”,“协议”,“免责声明”,“请”,“保留”,“它”, '机密']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM