简体   繁体   English

python正则表达式匹配,findall或搜索不返回值

[英]python regex matching, findall or search do not return value

I'm trying to find a token out of a string and return it. 我正在尝试从字符串中查找令牌并将其返回。 I am using this method on other strings and it works fine, but this one does not seem to return any result. 我在其他字符串上使用此方法,效果很好,但是此方法似乎未返回任何结果。 Not for findall and not for search. 不用于findall也不用于搜索。

pattern= re.compile(r'<input class="token"  value="(.+?)" name="csrftoken_reply">')
    matches = pattern.findall(htmlstring)
    for match in matches:
        print match

There is only one value in each response string. 每个响应字符串中只有一个值。 though I am not getting a result for "print match" 尽管我没有获得“打印匹配”的结果

I also tried using re.search but same thing happens, a NoneType object is returned... 我也尝试使用re.search,但是发生了同样的事情,返回了NoneType对象。

MORE INFO: 更多信息:

this is part of the html i'm parsing: 这是我解析的html的一部分:

<form id="threadReplyForm" class="clearfix" method="post" action="/go/messages/private/threadID=0551796">
<input class="csrftoken" type="hidden" value="a7b161b7" name="csrftoken_reply">
<input type="hidden" value="reply" name="action">
<div class="editorWrapper">
<div id="premiumSmiliesNotAllowed" class="warning" style="display: none;">
<div id="editor_13" class="clearfix editor" mode="full">
<ul id="editorToolbar_13" class="editorToolbar clearfix">
<textarea id="messageInput" class="autogrow" cols="20" rows="8" name="message"></textarea>
<div id="previewDiv" class="previewArea" style="display: none;"></div>
</div>
<script>
</div>
<script>
<span class="loadingIndicator right loadingIndicatorMessage">
<p class="clearfix">
</form>

parsing it with this : 与此解析:

pattern= re.compile(r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">')
    matches = pattern.findall(str(response.read()))
    for match in matches:
        print match

trying to get a7b161b7 as output 试图获取a7b161b7作为输出

You'll have to give an example of the string you are trying to parse, because this works for me. 您将不得不给出要解析的字符串的示例,因为这对我有用。

import re

htmlstring = """
<input class="token"  value="foo" name="csrftoken_reply">
"""

pattern= re.compile(r'<input class="token"  value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(htmlstring)
for match in matches:
    print match

Beyond that, have you considered using a library designed for something like this? 除此之外,您是否考虑过使用针对此类设计的库? Regex's can be a big fragile when it comes to parsing html. 正则表达式在解析html时可能非常脆弱。 Beautiful Soup seems to be a popular tool for this job. 美丽的汤似乎是这项工作的流行工具。

Update 更新资料

You've got a wrong class value, an extra space, and you forgot the 'input type="hidden"'. 您输入了错误的类值,多余的空间,并且忘记了“ input type =“ hidden””。 Here's something closer, though I would still discourage use of regex for this: 尽管我仍然不建议使用正则表达式,但这里距离更近一些:

r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">'

this works as well (I'm assuming there's one one 'csrftoken_reply' element): 这也很好(我假设有一个'csrftoken_reply'元素):

r'value="(.+?)" name="csrftoken_reply">'

Both of these work for me to get your desired value. 这两项对我来说都可以为您带来理想的价值。

Not a Python person and not recommending regex to parse html, but it might be 不是Python的人,也不建议使用正则表达式来解析html,但这可能是
possible to get unordered att-val data this way. 可能以这种方式获取无序的attval数据。 Just put in some pairs that is 只是成对放入
needed to qualify the tag. 需要使标签合格。 It doesn't have to be all of them or in any order. 不必全部都是它们,也不必是任何顺序。

Modifiers: expanded, single-line string, global. 修饰符:扩展的单行字符串,全局。
The value capture group is $5 价值捕获组为$ 5

Edit 编辑
Changed (?= (?:".*?"|\\'.*?\\'|[^>]*?)+ to (?= (?:[^>"\\']|(?>".*?"|\\'.*?\\'))*? because lazy quantifier in this form will be forced to overrun markup boundries to satisfy the lookahead. The new sub-expression handles attr="so< m >e" embedded markup, without overruns. (?= (?:".*?"|\\'.*?\\'|[^>]*?)+ ?: (?= (?:".*?"|\\'.*?\\'|[^>]*?)+更改为(?= (?:[^>"\\']|(?>".*?"|\\'.*?\\'))*?因为这种形式的惰性量词将被迫超出标记边界,以满足先行要求。新的子表达式处理attr="so< m >e"嵌入式标记,没有超支。

<input 
  (?=\s) 
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) class \s*=\s* ([\'"]) \s* csrftoken \s*\1 )
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) name  \s*=\s* ([\'"]) \s* csrftoken_reply \s*\2 )
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) type  \s*=\s* ([\'"]) \s* hidden \s*\3 )
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) value \s*=\s* ([\'"]) \s* (.*?)  \s*\4 )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ (?<!/)
>

All the caveats apply, could be hidden in imbedded code, could be comments, etc ... 所有注意事项都适用,可能隐藏在嵌入式代码中,可能是注释等...
Extra regex logic is needed for that. 为此,需要额外的正则表达式逻辑。

Sorry, parsing HTML with regex in 2011 is borderline insanity :) the number of libraries optimized for this task is quite large, the best ones being the above-mentioned BeautifulSoup and lxml; 抱歉,在2011年使用regex解析HTML太疯狂了:)为此任务而优化的库数量很多,最好的是上述的BeautifulSoup和lxml; I can understand you wouldn't want to deal with lxml because of its list of dependencies and messy installation, but BeautifulSoup is one file and would make your code so much more robust. 我可以理解,由于lxml的依赖关系列表和混乱的安装,您不希望对其进行处理,但是BeautifulSoup是一个文件 ,它将使您的代码更加健壮。

TL;DR: you're reinventing the wheel. TL; DR:您正在重新发明轮子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM