简体   繁体   English

正则表达式匹配多行html文本中的一个块

[英]Regular Expressions match a block from multiline html text

I have a few html files with two different patterns of a piece of code, where only name="horizon" is constant. 我有一些带有两种不同模式代码的html文件,其中只有name="horizon"是恒定的。 I need to get the value of an attribute named as "value". 我需要获取一个名为“值”的属性的值。 Below are the sample files:- 以下是示例文件:-
File1: 文件1:

<tag1> data
</tag1>
<select size="1" name="horizon">
    <option value="Admin">Admin Users</option>
    <option value="Remote Admin">Remote Admin</option>
</select>

File2: 文件2:

<othertag some_att="asfa"> data
</othertag>
<select id="realm_17" size="1" name="horizon">
    <option id="option_LoginPage_1" value="Admin Users">Admin Users</option>
    <option id="option_LoginPage_1" value="Global-User">Global-User</option>
</select>

Since the files will have other tags and attributes, I tried writing regular expressions by referring this to filter the required content from the files with these regular expressions. 由于文件将具有其他标记和属性,因此我尝试通过引用规则来编写正则表达式,以使用这些正则表达式从文件中过滤所需的内容。

regex='^(?:.*?)(<(?P<TAG>\w+).+name\=\"horizon\"(?:.*[\n|\r\n?]*)+?<\/(?P=TAG>)'

I have tried this with re.MULTILINE and re.DOTALL but could not get desired text. 我已经用re.MULTILINEre.DOTALL尝试过此re.MULTILINE ,但无法获取所需的文本。
I suppose, I would be able to find the required names as list by using re.findall('value\\=\\"(.*)\\",text) once I get the required text. 我想,一旦获得所需的文本re.findall('value\\=\\"(.*)\\",text)便可以通过使用re.findall('value\\=\\"(.*)\\",text)找到所需的名称作为列表。
Please suggest if there is any elegant way to handle the situation. 请提出是否有任何优雅的方式来处理这种情况。

I completely agree @ZiTAL when saying that parsing the files as XML would be much faster and nicer. 我完全同意@ZiTAL,说将文件解析为XML会更快,更好。

A few simple lines of code would solve your problem: 一些简单的代码行可以解决您的问题:

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

# If you prefer to parse the text directly do root = ET.fromstring('<root>example</root>')

values = [el.attrib['value'] for el in root.findall('.//option')]

print(values)

Try this regex ! 试试这个正则表达式!

value="(.*)">

This is simple regex for extracting the value from your html files . 这是用于从html文件提取值的简单正则表达式。 This regex shows that extract anything between double quotes & after "value=" & before ">" . 此正则表达式显示提取双引号之间以及“ value =”和&“>之前的所有内容。

I am also attach the screenshot of the output ! 我还附上输出的屏幕截图!

输出量

I tried the xml.etree.ElementTree module as explained by @kazbeel but it gave me error of "mismatched tag", which I found is the case in most instances of its usage. 我尝试了xml.etree.ElementTree所解释的xml.etree.ElementTree模块,但它给了我“标签不匹配”的错误,我发现在大多数情况下都是这样。 Then I found this BeautifulSoup module and used it, and it gave the desired results. 然后,我找到了这个BeautifulSoup模块并使用了它,它给出了想要的结果。 The following code has covered another file pattern along with the above ones from the question. 以下代码涵盖了另一个文件模式以及问题中的上述文件模式。
File3: 文件3:

<input id="realm_90" type="hidden" name="horizon" value="RADIUS">

Code: 码:

from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
    realms=[]
    soup=BeautifulSoup(html_text, 'lxml')
    in_tag=soup.find(attrs={"name":"horizon"})
    if in_tag.name == 'select':
        for tag in in_tag.find_all():
            realms.append(tag.attrs['value'])
    elif in_tag.name == 'input':
        realms.append(in_tag.attrs['value'])
    return realms

I agree with @ZiTAL to not to use regular expressions when parsing xml/html files because it gets too complicated and there are number of libraries present for them. 我同意@ZiTAL的观点,在解析xml / html文件时不要使用正则表达式,因为它变得太复杂了,并且为它们提供了许多库

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM