python正則表達式匹配，findall或搜索不返回值

Question

我正在嘗試從字符串中查找令牌並將其返回。 我在其他字符串上使用此方法，效果很好，但是此方法似乎未返回任何結果。 不用於findall也不用於搜索。

pattern= re.compile(r'<input class="token"  value="(.+?)" name="csrftoken_reply">')
    matches = pattern.findall(htmlstring)
    for match in matches:
        print match

每個響應字符串中只有一個值。 盡管我沒有獲得“打印匹配”的結果

我也嘗試使用re.search，但是發生了同樣的事情，返回了NoneType對象。

更多信息：

這是我解析的html的一部分：

<form id="threadReplyForm" class="clearfix" method="post" action="/go/messages/private/threadID=0551796">
<input class="csrftoken" type="hidden" value="a7b161b7" name="csrftoken_reply">
<input type="hidden" value="reply" name="action">
<div class="editorWrapper">
<div id="premiumSmiliesNotAllowed" class="warning" style="display: none;">
<div id="editor_13" class="clearfix editor" mode="full">
<ul id="editorToolbar_13" class="editorToolbar clearfix">
<textarea id="messageInput" class="autogrow" cols="20" rows="8" name="message"></textarea>
<div id="previewDiv" class="previewArea" style="display: none;"></div>
</div>
<script>
</div>
<script>
<span class="loadingIndicator right loadingIndicatorMessage">
<p class="clearfix">
</form>

與此解析：

pattern= re.compile(r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">')
    matches = pattern.findall(str(response.read()))
    for match in matches:
        print match

試圖獲取a7b161b7作為輸出

Answer 1

您將不得不給出要解析的字符串的示例，因為這對我有用。

import re

htmlstring = """
<input class="token"  value="foo" name="csrftoken_reply">
"""

pattern= re.compile(r'<input class="token"  value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(htmlstring)
for match in matches:
    print match

除此之外，您是否考慮過使用針對此類設計的庫？ 正則表達式在解析html時可能非常脆弱。 美麗的湯似乎是這項工作的流行工具。

更新資料

您輸入了錯誤的類值，多余的空間，並且忘記了“ input type =“ hidden””。 盡管我仍然不建議使用正則表達式，但這里距離更近一些：

r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">'

這也很好（我假設有一個'csrftoken_reply'元素）：

r'value="(.+?)" name="csrftoken_reply">'

這兩項對我來說都可以為您帶來理想的價值。

Answer 2

不是Python的人，也不建議使用正則表達式來解析html，但這可能是
可能以這種方式獲取無序的attval數據。 只是成對放入
需要使標簽合格。 不必全部都是它們，也不必是任何順序。

修飾符：擴展的單行字符串，全局。
價值捕獲組為$ 5

編輯
將(?= (?:".*?"|\\'.*?\\'|[^>]*?)+ ？： (?= (?:".*?"|\\'.*?\\'|[^>]*?)+更改為(?= (?:[^>"\\']|(?>".*?"|\\'.*?\\'))*?因為這種形式的惰性量詞將被迫超出標記邊界，以滿足先行要求。新的子表達式處理attr="so< m >e"嵌入式標記，沒有超支。

<input 
  (?=\s) 
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) class \s*=\s* ([\'"]) \s* csrftoken \s*\1 )
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) name  \s*=\s* ([\'"]) \s* csrftoken_reply \s*\2 )
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) type  \s*=\s* ([\'"]) \s* hidden \s*\3 )
  (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) value \s*=\s* ([\'"]) \s* (.*?)  \s*\4 )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ (?<!/)
>

所有注意事項都適用，可能隱藏在嵌入式代碼中，可能是注釋等...
為此，需要額外的正則表達式邏輯。

Answer 3

抱歉，在2011年使用regex解析HTML太瘋狂了：)為此任務而優化的庫數量很多，最好的是上述的BeautifulSoup和lxml； 我可以理解，由於lxml的依賴關系列表和混亂的安裝，您不希望對其進行處理，但是BeautifulSoup是一個文件 ，它將使您的代碼更加健壯。

TL; DR：您正在重新發明輪子。

python正則表達式匹配，findall或搜索不返回值

問題描述

3 個解決方案

解決方案1
1 2012-01-07 15:32:04

解決方案2
0 已采納

解決方案3
0 2012-01-08 20:15:46

python正則表達式匹配，findall或搜索不返回值

問題描述

3 個解決方案

解決方案1 1 2012-01-07 15:32:04

解決方案2 0 已采納

解決方案3 0 2012-01-08 20:15:46

解決方案1
1 2012-01-07 15:32:04

解決方案2
0 已采納

解決方案3
0 2012-01-08 20:15:46