简体   繁体   中英

python regex replace unicode

In the first test string, I'm trying to replace the Unicode right arrows char in the middle of the text with a space, but it doesn't seem to be working.

In general, I'm trying to remove all single character or more unicode "non-words", but keeping words if they are a mixture of a-z0-9 and unicode or just \\w

# -*- coding: utf-8 -*-
import re
str = 'hi… » Test'
str = 're of… » Pr'
str = 're of… » Pr | removepipeaswell'
print str
str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)
# str = re.sub(r' [^\p{Alpha}] ', ' ', str, re.UNICODE)
print str
're of… Pr removepipeaswell' #expected output

str_nbsp = 'afds » asf'

edit: added another test string, i dont want to remove the "of..." (unicode dots), i want to remove multiple unicode (non-word) chars only.

edit: using this works for the test case, (but not in the full html??? - it only appears to replace matches to the first half to the string, then ignores the rest.)

str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)

edit: fml, it had to be something stupid like not reading the argument list properly: http://bytes.com/topic/python/answers/689341-sub-does-not-replace-all-occurences

[whoever just deleted their response - thank you for your help.]

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

The final test string "str_nbsp" did not match the regex above. One of the space characters is actually a non breaking space character. I used www.regexr.com and hovered over each character to figure this out.

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM