简体   繁体   中英

Python regex replacing \u2022

This is my string:

raw_list = u'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-growing start-ups.  Specialties \u2022 Languages and Frameworks: JavaScript (Nodejs, React), Android, Ruby on Rails 4, iOS (Swift) \u2022 Databases: Mongodb, Postgresql, MySQL, Redis \u2022 Testing Frameworks: Mocha, Rspec xxxx Others: Sphinx, MemCached, Chef.'

I'm trying to replace the \• with just a space.

x=re.sub(r'\u2022', ' ', raw_list)

But it's not working. What am I doing wrong?

You're using a raw string, with the r . That tells Python to interpret the string literally, instead of actually taking escaped characters (such as \\n).

>>> r'\u2022'
'\\u2022'

You can see it's actually a double backslash. Instead you want to use >>> u'\•' and then it will work.

Note that since you're doing a simple replacement you can just use the str.replace method:

x = raw_list.replace(u'\u2022', ' ')

You only need a regex replace for complicated pattern matching.

Unless you use a Unicode string literal, the \\uhhhh escape sequence has no meaning. Not to Python, and not to the re module. Add the u prefix:

re.sub(ur'\u2022', ' ', raw_list)

Note the ur there; that's a raw unicode string literal; this still interprets \\uhhhh unicode escape sequences (but is otherwise identical to the standard raw string literal mode). The re module doesn't support such escape sequences itself (but it does support most other Python string escape sequences).

Not that you need to use a regular expression here, a simple unicode.replace() would suffice:

raw_list.replace(u'\u2022', u' ')

or you can use unicode.translate() :

raw_list.translate({0x2022: u' '})

This is my approach, changing regex pattern, you might try

re.sub(r'[^\x00-\x7F]+','',raw_list)

Out[1]: u'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-growing start-ups. Specialties Languages and Frameworks: JavaScript (Nodejs, React), Android, Ruby on Rails 4, iOS (Swift) Databases: Mongodb, Postgresql, MySQL, Redis Testing Frameworks: Mocha, Rspec xxxx Others: Sphinx, MemCached, Chef.'

The key is to add the unicode u in front of the unicode character that you're trying to find - in this case the \• which is the unicode character for a bullet. If your text contains unicode characters then your text is actually unicode text as opposed to a string (you can confirm by printing out your text and looking for the u at the beginning). See the below example, where I search for a unicode bullet character using regular expressions (RegEx) on both a string and unicode text:

import regular expressions package:
 import re 
unicode text:
my_string = """\u2022 Here\'s a string of data. \n<br/>\u2022There are new 
line characters \n, HTML line break tags <br/>, and bullets \u2002 together in 
a sequence.\n<br/>\u2022 Our goal is to use RegEx to identify the sequences."""

type(my_string)     #string 
string:
 my_string = """\• Here\\'sa string of data. \\n<br/>\•There are new line characters \\n, HTML line break tags <br/>, and bullets \  together in a sequence.\\n<br/>\• Our goal is to use RegEx to identify the sequences.""" type(my_string) #string 
we successfully find the first piece of text that we're looking for which doesn't yet contain the unicode characters:
re.findall('\n<br/>\\\\u', my_unicode)

re.findall('\n<br/>\\\\u', my_string)
with the addition of the unicode character, neither substring can be found:
 re.findall('\\n<br/>\•', my_unicode) re.findall('\\n<br/>\•', my_string) 
Adding four backslashes works for the string, but it does not work for the unicode text:
 re.findall('\\n<br/>\\\\\\\\u\u0026#39;, my_unicode) re.findall('\\n<br/>\\\\\\\\u\u0026#39;, my_string) 
Solution: Include the unicode u in front of the unicode character:
 re.findall('\\n<br/>' u'\•', my_unicode) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM