简体   繁体   English

Python正则表达式替换\\ u2022

[英]Python regex replacing \u2022

This is my string: 这是我的字符串:

raw_list = u'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-growing start-ups.  Specialties \u2022 Languages and Frameworks: JavaScript (Nodejs, React), Android, Ruby on Rails 4, iOS (Swift) \u2022 Databases: Mongodb, Postgresql, MySQL, Redis \u2022 Testing Frameworks: Mocha, Rspec xxxx Others: Sphinx, MemCached, Chef.'

I'm trying to replace the \• with just a space. 我正在尝试仅用空格替换\•

x=re.sub(r'\u2022', ' ', raw_list)

But it's not working. 但这不起作用。 What am I doing wrong? 我究竟做错了什么?

You're using a raw string, with the r . 您正在使用带有r的原始字符串。 That tells Python to interpret the string literally, instead of actually taking escaped characters (such as \\n). 这告诉Python从字面上解释字符串,而不是实际使用转义字符(例如\\ n)。

>>> r'\u2022'
'\\u2022'

You can see it's actually a double backslash. 您可以看到它实际上是一个双反斜杠。 Instead you want to use >>> u'\•' and then it will work. 相反,您想使用>>> u'\•' ,然后它将起作用。

Note that since you're doing a simple replacement you can just use the str.replace method: 请注意,由于您要进行简单的替换,因此只能使用str.replace方法:

x = raw_list.replace(u'\u2022', ' ')

You only need a regex replace for complicated pattern matching. 您只需要使用正则表达式替换即可进行复杂的模式匹配。

Unless you use a Unicode string literal, the \\uhhhh escape sequence has no meaning. 除非您使用Unicode字符串文字,否则\\uhhhh转义序列没有任何意义。 Not to Python, and not to the re module. 不用于Python,也不用于re模块。 Add the u prefix: 添加u前缀:

re.sub(ur'\u2022', ' ', raw_list)

Note the ur there; 注意那里的ur that's a raw unicode string literal; 那是原始的unicode字符串文字; this still interprets \\uhhhh unicode escape sequences (but is otherwise identical to the standard raw string literal mode). 这仍然会解释\\uhhhh Unicode转义序列(但在其他方面与标准原始字符串文字模式相同)。 The re module doesn't support such escape sequences itself (but it does support most other Python string escape sequences). re模块本身不支持此类转义序列(但它支持大多数其他Python字符串转义序列)。

Not that you need to use a regular expression here, a simple unicode.replace() would suffice: 不必在这里使用正则表达式,一个简单的unicode.replace()就足够了:

raw_list.replace(u'\u2022', u' ')

or you can use unicode.translate() : 或者您可以使用unicode.translate()

raw_list.translate({0x2022: u' '})

This is my approach, changing regex pattern, you might try 这是我的方法,更改正则表达式模式,您可以尝试

re.sub(r'[^\x00-\x7F]+','',raw_list)

Out[1]: u'Software Engineer with a huge passion for new and innovative products. 出[1]:u'软件工程师,对新的创新产品充满热情。 Experienced gained from working in both big and fast-growing start-ups. 从大型和快速成长的初创公司工作中积累的经验。 Specialties Languages and Frameworks: JavaScript (Nodejs, React), Android, Ruby on Rails 4, iOS (Swift) Databases: Mongodb, Postgresql, MySQL, Redis Testing Frameworks: Mocha, Rspec xxxx Others: Sphinx, MemCached, Chef.' 特殊语言和框架:JavaScript(Nodejs,React),Android,Ruby on Rails 4,iOS(Swift)数据库:Mongodb,Postgresql,MySQL,Redis测试框架:Mocha,Rspec xxxx其他:Sphinx,MemCached,Chef。

The key is to add the unicode u in front of the unicode character that you're trying to find - in this case the \• which is the unicode character for a bullet. 关键是将unicode u添加到要查找的unicode字符之前-在本例中为\• ,它是项目符号的unicode字符。 If your text contains unicode characters then your text is actually unicode text as opposed to a string (you can confirm by printing out your text and looking for the u at the beginning). 如果您的文本包含unicode字符,则您的文本实际上是unicode文本,而不是字符串(您可以通过打印文本并在开头查找u来进行确认)。 See the below example, where I search for a unicode bullet character using regular expressions (RegEx) on both a string and unicode text: 请参见下面的示例,在该示例中,我同时在字符串和unicode文本上使用正则表达式(RegEx)搜索Unicode项目符号字符:

import regular expressions package: 导入正则表达式包:
 import re 
unicode text: Unicode文字:
my_string = """\u2022 Here\'s a string of data. \n<br/>\u2022There are new 
line characters \n, HTML line break tags <br/>, and bullets \u2002 together in 
a sequence.\n<br/>\u2022 Our goal is to use RegEx to identify the sequences."""

type(my_string)     #string 
string: 串:
 my_string = """\• Here\\'sa string of data. \\n<br/>\•There are new line characters \\n, HTML line break tags <br/>, and bullets \  together in a sequence.\\n<br/>\• Our goal is to use RegEx to identify the sequences.""" type(my_string) #string 
we successfully find the first piece of text that we're looking for which doesn't yet contain the unicode characters: 我们成功找到了要查找的第一段文本,但该段文本尚未包含unicode字符:
re.findall('\n<br/>\\\\u', my_unicode)

re.findall('\n<br/>\\\\u', my_string)
with the addition of the unicode character, neither substring can be found: 加上unicode字符,找不到任何子字符串:
 re.findall('\\n<br/>\•', my_unicode) re.findall('\\n<br/>\•', my_string) 
Adding four backslashes works for the string, but it does not work for the unicode text: 添加四个反斜杠适用于该字符串,但不适用于unicode文本:
 re.findall('\\n<br/>\\\\\\\\u\u0026#39;, my_unicode) re.findall('\\n<br/>\\\\\\\\u\u0026#39;, my_string) 
Solution: Include the unicode u in front of the unicode character: 解决方案:在unicode字符前面包括unicode u
 re.findall('\\n<br/>' u'\•', my_unicode) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM