Python regex 在查找特殊的 unicode 字符时遇到问题

Question

I am currently parsing through some old exams to determine the frequency of the questions (because many questions would resurface at this years exam).我目前正在解析一些旧的考试以确定问题的频率（因为今年的考试会重新出现许多问题）。 I am using pyperclip to get the input for the re.findall.我正在使用 pyperclip 来获取 re.findall 的输入。

This is the regex I am using: pattern = re.compile(ur'\\d.[a-zA-Z .,\\']+\\?', re.UNICODE) , and this is an example question on an older exam (the pattern I am trying to find): 9. In Wycherley's The Country Wife, what does Mr. Pinchwife threaten to inscribe on Mrs. Pinchwife's face with his penknife?这是我使用的正则表达式： pattern = re.compile(ur'\\d.[a-zA-Z .,\\']+\\?', re.UNICODE) ，这是旧考试的示例问题（我试图找到的模式）： 9. In Wycherley's The Country Wife, what does Mr. Pinchwife threaten to inscribe on Mrs. Pinchwife's face with his penknife? The apostrophe is not one I can find on my keyboard, and trying to execute the code results in this error:我在键盘上找不到撇号，尝试执行代码会导致此错误：

 File "examAnalyzer.py", line 7
    pattern = re.compile(ur'\d.[a-zA-Z .,\Æ]+\?', re.UNICODE)
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

I am using Python 2.7.11 and Anaconda 4.0, and the Python file is edited using VIM.我使用的是 Python 2.7.11 和 Anaconda 4.0，Python 文件是使用 VIM 编辑的。

Answer 1

You can use the \\u\u003c/code> representation of the apostrophe, which is \’ .您可以使用撇号的\\u\u003c/code>表示，即\’ 。

Also, the dot should be escaped to match a literal dot symbol.此外，应该对点进行转义以匹配文字点符号。

Use用

ur'\d\.[a-zA-Z .,\'\u2019]+\?'
     ^^            ^^^^^^

When in doubt what the hex representation a symbol has, you can check it at r12a >> apps >> Unicode code converter .如果对符号的十六进制表示有疑问，您可以在r12a >> apps >> Unicode code converter 中检查它。

Answer 2

Your python file has declared a file encoding of utf8 but the file itself is saved in another encoding.您的 python 文件已声明文件编码为 utf8，但文件本身以另一种编码保存。

You should give the correct encoding in the first line:您应该在第一行给出正确的编码：

# -*- coding: <correct encoding> -*-

Python regex 在查找特殊的 unicode 字符时遇到问题

问题描述

2 个解决方案

解决方案1
1 2016-05-27 20:18:16

解决方案2
0 2016-05-27 20:15:09

Python regex 在查找特殊的 unicode 字符时遇到问题

问题描述

2 个解决方案

解决方案1 1 2016-05-27 20:18:16

解决方案2 0 2016-05-27 20:15:09

解决方案1
1 2016-05-27 20:18:16

解决方案2
0 2016-05-27 20:15:09