简体   繁体   English

将unicode字符串拆分为单词

[英]splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this: 我试图将Unicode字符串拆分为单词(简单化),如下所示:

print re.findall(r'(?u)\w+', "раз два три")

What I expect to see is: 我期望看到的是:

['раз','два','три']

But what I really get is: 但我真正得到的是:

['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']

What am I doing wrong? 我究竟做错了什么?

Edit: 编辑:

If I use u in front of the string: 如果我在字符串前面使用u

print re.findall(r'(?u)\w+', u"раз два три")

I get: 我明白了:

[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']

Edit 2: 编辑2:

Aaaaand it seems like I should have read docs first: Aaaaand似乎我应该首先阅读文档:

 print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')

Will give me: 会给我:

раз

Just to make sure though, does that sound like a proper way of approaching it? 只是为了确保,这听起来像是接近它的正确方法吗?

You're actually getting the stuff you expect in the unicode case. 你实际上是在unicode案例中获得了你期望的东西。 You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. 你只是认为你不是因为你正在查看字符串的重复 ,而不是打印他们未转义的值这一事实。 (This is just how lists are displayed.) (这就是列表的显示方式。)

>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438'] 
>>> for w in words:
...     print w # This uses the terminal encoding -- _only_ utilize interactively
... 
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True

Don't miss my remark about printing these unicode strings. 不要错过我关于打印这些unicode字符串的评论。 Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. 通常,如果您要将它们发送到屏幕,文件,电线等,您需要手动将它们编码为正确的编码。 When you use print , Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. 当您使用print ,Python会尝试利用终端的编码,但只有在有终端时才能这样做。 Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise. 因为您通常不知道是否存在,所以您应该只在交互式解释器中依赖它,并且总是以明确的方式编码为正确的编码。

In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method. 在这种简单的空白分割方法中,您可能根本不想使用正则表达式而只是使用unicode.split方法。

>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']

Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. 你的top(bytestring)示例不起作用,因为re基本上假设所有字节串都是ASCII语义,但是你的字符串不是。 Using unicode strings allows you to get the right semantics for your alphabet and locale. 使用unicode字符串可以为您的字母和区域设置获得正确的语义。 As much as possible, textual data should always be represented using unicode rather than str . 尽可能使用unicode而不是str来表示文本数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM