正則表達式匹配多行html文本中的一個塊

Question

我有一些帶有兩種不同模式代碼的html文件，其中只有name="horizon"是恆定的。 我需要獲取一個名為“值”的屬性的值。 以下是示例文件：-
文件1：

<tag1> data
</tag1>
<select size="1" name="horizon">
    <option value="Admin">Admin Users</option>
    <option value="Remote Admin">Remote Admin</option>
</select>

文件2：

<othertag some_att="asfa"> data
</othertag>
<select id="realm_17" size="1" name="horizon">
    <option id="option_LoginPage_1" value="Admin Users">Admin Users</option>
    <option id="option_LoginPage_1" value="Global-User">Global-User</option>
</select>

由於文件將具有其他標記和屬性，因此我嘗試通過引用此規則來編寫正則表達式，以使用這些正則表達式從文件中過濾所需的內容。

regex='^(?:.*?)(<(?P<TAG>\w+).+name\=\"horizon\"(?:.*[\n|\r\n?]*)+?<\/(?P=TAG>)'

我已經用re.MULTILINE和re.DOTALL嘗試過此re.MULTILINE ，但無法獲取所需的文本。
我想，一旦獲得所需的文本re.findall('value\\=\\"(.*)\\",text)便可以通過使用re.findall('value\\=\\"(.*)\\",text)找到所需的名稱作為列表。
請提出是否有任何優雅的方式來處理這種情況。

Answer 1

我完全同意@ZiTAL，說將文件解析為XML會更快，更好。

一些簡單的代碼行可以解決您的問題：

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

# If you prefer to parse the text directly do root = ET.fromstring('<root>example</root>')

values = [el.attrib['value'] for el in root.findall('.//option')]

print(values)

Answer 2

試試這個正則表達式！

value="(.*)">

這是用於從html文件提取值的簡單正則表達式。 此正則表達式顯示提取雙引號之間以及“ value =”和＆“>之前的所有內容。

我還附上輸出的屏幕截圖！

Answer 3

我嘗試了xml.etree.ElementTree所解釋的xml.etree.ElementTree模塊，但它給了我“標簽不匹配”的錯誤，我發現在大多數情況下都是這樣。 然后，我找到了這個BeautifulSoup模塊並使用了它，它給出了想要的結果。 以下代碼涵蓋了另一個文件模式以及問題中的上述文件模式。
文件3：

<input id="realm_90" type="hidden" name="horizon" value="RADIUS">

碼：

from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
    realms=[]
    soup=BeautifulSoup(html_text, 'lxml')
    in_tag=soup.find(attrs={"name":"horizon"})
    if in_tag.name == 'select':
        for tag in in_tag.find_all():
            realms.append(tag.attrs['value'])
    elif in_tag.name == 'input':
        realms.append(in_tag.attrs['value'])
    return realms

我同意@ZiTAL的觀點，在解析xml / html文件時不要使用正則表達式，因為它變得太復雜了，並且為它們提供了許多庫。

正則表達式匹配多行html文本中的一個塊

問題描述

3 個解決方案

解決方案1
2 2018-01-11 11:49:46

解決方案2
0 2018-01-11 11:57:18

解決方案3
0 已采納 2018-01-16 03:09:06

正則表達式匹配多行html文本中的一個塊

問題描述

3 個解決方案

解決方案1 2 2018-01-11 11:49:46

解決方案2 0 2018-01-11 11:57:18

解決方案3 0 已采納 2018-01-16 03:09:06

解決方案1
2 2018-01-11 11:49:46

解決方案2
0 2018-01-11 11:57:18

解決方案3
0 已采納 2018-01-16 03:09:06