简体   繁体   English

为什么这个正则表达式不起作用?

[英]Why does this regular expression not work?

I have a function that parses HTML code so it is easy to read and write with. 我有一个解析HTML代码的函数,因此它易于读写。 In order to do this I must split the string with multiple delimiters and as you can see I have used re.split() and I cannot find a better solution. 为了做到这一点,我必须用多个分隔符分割字符串,你可以看到我使用了re.split() ,我找不到更好的解决方案。 However, when I submit some HTML such as this , it has absolutely no effect. 但是,当我提交一些像这样的 HTML时,它绝对没有效果。 This has lead me to believe that my regular expression is incorrectly written. 这让我相信我的正则表达式写得不正确。 What should be there instead? 那应该是什么?

def parsed(data):
    """Removes junk from the data so it can be easily processed."""
    data = str(data)
    # This checks for a cruft and removes it if it exists.
    if re.search("b'", data):
        data = data[2:-1]
    lines = re.split(r'\r|\n', data)  # This clarifies the lines for writing.
    return lines

This isn't a duplicate if you find a similar question, I've been crawling around for ages and it still doesn't work. 如果你发现一个类似的问题,这不是重复,我已经爬了好几年,它仍然无法正常工作。

You are converting a bytes value to string: 您正在将bytes值转换为字符串:

data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
    data = data[2:-1]

which means that all line delimiters have been converted to their Python escape codes: 这意味着所有行分隔符都已转换为其Python转义码:

>>> str(b'\n')
"b'\n'"

That is a literal b , literal quote, literal \\ backslash, literal n , literal quote. 这是一个文字b ,文字引用,文字\\反斜杠,文字n ,文字引用。 You would have to split on r'(\\\\n|\\\\r)' instead, but most of all, you shouldn't turn bytes values to string representations here. 你必须拆分r'(\\\\n|\\\\r)' ,但最重要的是,你不应该在这里将字节值转换为字符串表示。 Python produced the representation of the bytes value as a literal string you can paste back into your Python interpreter, which is not the same thing as the value contained in the object. Python将字节值的表示形式作为文字字符串,您可以将其粘贴回Python解释器,这与对象中包含的值不同

You want to decode to string instead: 您想要解码为字符串:

if isinstance(data, bytes):
    data = data.decode('utf8')

where I am assuming that the data is encoded with UTF8. 我假设数据是用UTF8编码的。 If this is data from a web request, the response headers quite often include the character set used to encode the data in the Content-Type header, look for the charset= parameter. 如果这是来自Web请求的数据,则响应头通常包括用于对Content-Type头中的数据进行编码的字符集,查找charset=参数。

A response produced by the urllib.request module has an .info() method, and the character set can be extracted (if provided) with: urllib.request模块生成的响应具有.info()方法,并且可以提取字符集(如果提供):

charset = response.info().get_param('charset')

where the return value is None if no character set was provided. 如果没有提供字符集,则返回值为None

You don't need to use a regular expression to split lines, the str type has a dedicated method, str.splitlines() : 您不需要使用正则表达式来分割线, str类型有一个专用方法str.splitlines()

Return a list of the lines in the string, breaking at line boundaries. 返回字符串中的行列表,在行边界处断开。 This method uses the universal newlines approach to splitting lines. 此方法使用通用换行方法来分割线。 Line breaks are not included in the resulting list unless keepends is given and true. 除非给出keepends且为true,否则换行符不包括在结果列表中。

For example, 'ab c\\n\\nde fg\\rkl\\r\\n'.splitlines() returns ['ab c', '', 'de fg', 'kl'] , while the same call with splitlines(True) returns ['ab c\\n', '\\n', 'de fg\\r', 'kl\\r\\n'] . 例如, 'ab c\\n\\nde fg\\rkl\\r\\n'.splitlines()返回['ab c', '', 'de fg', 'kl'] ,而使用splitlines(True)进行相同的调用splitlines(True)返回['ab c\\n', '\\n', 'de fg\\r', 'kl\\r\\n']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM