需要在Python的正则表达式中使用特殊字符后逃脱字符？

Question

I have the following python code: 我有以下python代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
line = 'div><div class="fieldRow jr_name"><div class="fieldLabel">name<'
regex0 = re.compile('(.+?)\v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex1 = re.compile('(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex2 = re.compile('(.+?) class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

m0 = regex0.match(line)
m1 = regex1.match(line)
m2 = regex2.match(line)

if m0:
    print 'regex0 is good'
else:
    print 'regex0 is no good'

if m1:
    print 'regex1 is good'
else:
    print 'regex1 is no good'

if m2:
    print 'regex2 is good'
else:
    print 'regex2 is no good'

The output is 输出是

regex0 is good
regex1 is no good
regex2 is good

I don't quite understand why I need to escape the character 'v' after "(.+?)" in regex0. 我不太明白为什么我需要在regex0中的“（。+？）”之后转义字符'v'。 If I don't escape, which will become regex1, then the matching will fail. 如果我没有逃脱，这将成为regex1，那么匹配将失败。 However, for space right after "(.+?)" in regex3, I don't have to escape. 但是，对于regex3中“（。+？）”之后的空格，我不必逃避。

Any idea? 任何想法？

Thanks in advance. 提前致谢。

Answer 1

So, there are some issues with your approach The ones that contribute to your specific complaint are: 因此，您的方法存在一些问题导致您的具体投诉的问题包括：

You do not mark te regexp string as raw ( r' prefix) - that makes the Python compiler change some "\\" prefixed characters inside the string before they even reach the re.match call. 您没有将te regexp字符串标记为raw（ r'前缀） - 这使得Python编译器在它们甚至到达re.match调用之前更改了字符串中的一些“\\”前缀字符。
"\\v" happens to be one such character - a vertical tab that is replaced by "\\0x0b" “\\ v”碰巧是一个这样的字符 - 一个垂直标签，被“\\ 0x0b”取代
You use the "re.VERBOSE" flag - that simply tells the regexp engine to ignore any whitesapce character. 你使用“re.VERBOSE”标志 - 它只是告诉regexp引擎忽略任何whitesapce字符。 "\\v" being a vertical tab is one character in this class and is ignored. 作为垂直制表符的“\\ v”是此类中的一个字符，将被忽略。

So, there is your match for regex0: the letter "v" os never seem as such. 所以，你有匹配regex0：字母“v”os似乎从来没有这样。

Now, for the possible fixes on you approach, in the order that you should be trying to do them: 现在，对于您可能的修复方法，按照您应该尝试执行的顺序：

1) Don't use regular expressions to parse HTML. 1）不要使用正则表达式来解析HTML。 Really. 真。 There are a lot of packages that can do a good job on parsing HTML, and in missing those you can use stdlib's own HTMLParser ( html.parser in Python3); 有很多软件包可以很好地解析HTML，缺少那些可以使用stdlib自己的HTMLParser （ html.parser中的html.parser）;

2) If possible, use Python 3 instead of Python 2 - you will be bitten on the first non-ASCII character inside yourt HTML body if you go on with the naive approach of treating Python2 strings as "real life" text. 2）如果可能的话，使用Python 3而不是Python 2 - 如果继续将Python2字符串视为“真实生活”文本的天真方法，那么您将被咬到HTML体内的第一个非ASCII字符。 Python 3 automatic encoding handling (and explicit settings allowed to you when it is not automatic) . Python 3自动编码处理（当它不是自动时允许显式设置）。

Since you are probably not changing anyway, so try to use regex.findall instead of regex.match - this returns a list of matching strings and could retreive the attributes you are looking at once, without searching from the beggining of the file, or depending on line-breaks inside the HTML. 因为你可能没有改变，所以尝试使用regex.findall而不是regex.match - 这会返回一个匹配字符串的列表，并且可以检索你正在查看的属性，而无需从文件的开始搜索，或者依赖于在HTML中的换行符。

Answer 2

There is a special symbol in Python regex \\v, about which you can read here: https://docs.python.org/2/library/re.html Python regex \\ v中有一个特殊的符号，你可以在这里阅读： https ： //docs.python.org/2/library/re.html

Python regex usually are written in r'your regex' block, where "r" means raw string. Python正则表达式通常用r'your regex'块编写，其中“r”表示原始字符串。 ( https://docs.python.org/3/reference/lexical_analysis.html ) （ https://docs.python.org/3/reference/lexical_analysis.html ）

In your code all special characters should be escaped to be understood as normal characters. 在您的代码中，所有特殊字符都应该被转义为普通字符。 Eg if you write s - this is space, \\s is just "s". 例如，如果你写s - 这是空格，\\ s只是“s”。 To make it work in an opposite way use raw strings. 为了使它以相反的方式工作，使用原始字符串。 The line below is the solution you need, I believe. 我相信下面这一行是您需要的解决方案。

regex1 = re.compile(r'(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

需要在Python的正则表达式中使用特殊字符后逃脱字符？

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-06-06 14:43:10

解决方案2
0 2017-06-06 14:22:41

需要在Python的正则表达式中使用特殊字符后逃脱字符？

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-06-06 14:43:10

解决方案2 0 2017-06-06 14:22:41

解决方案1
3 已采纳 2017-06-06 14:43:10

解决方案2
0 2017-06-06 14:22:41