简体   繁体   English

lxml - 在findall()中使用正则表达式按属性值查找标签

[英]lxml - using regex in findall() to find tags by attribute values

I'm trying to use lxml to get an array of comments that are formatted as 我正在尝试使用lxml来获取格式为的注释数组

<div id="comment-1">
  TEXT
</div>

<div id="comment-2">
  TEXT
</div>

<div id="comment-3">
  TEXT
</div>
...

I tried using 我试过用

html.findall(".//div[@id='comment-*']")

but this searches for a literal asterisk. 但是这会搜索文字星号。

What would be the right syntax for what I'm trying to do? 我正在尝试做什么是正确的语法?

EDIT: I finally got it working by doing 编辑:我终于通过这样做了

doc = lxml.html.parse(url).getroot()
comment_array = doc.xpath('.//div[starts-with(@id, "comment-")]')

You can use regular XPath functions to find the comments as you suggested: 您可以使用常规XPath函数来查找建议的注释:

comments = doc.xpath('.//div[starts-with(@id, "comment-")]')

But, for more complex matching, you could use regular expressions : with lxml, XPath supports regular expressions in the EXSLT namespace. 但是,对于更复杂的匹配,您可以使用正则表达式 :对于lxml,XPath支持EXSLT命名空间中的正则表达式。 See the official documentation Regular expressions in XPath . 请参阅XPath中的正式表达式的官方文档。

Here is a demo: 这是一个演示:

from lxml import etree

content = """\
<body>
<div id="comment-1">
  TEXT
</div>

<div id="comment-2">
  TEXT
</div>

<div id="comment-3">
  TEXT
</div>

<div id="note-4">
  not matched
</div>
</body>
"""

doc = etree.XML(content)

# You must give the namespace to use EXSLT RegEx
REGEX_NS = "http://exslt.org/regular-expressions"

comments = doc.xpath(r'.//div[re:test(@id, "^comment-\d+$")]',
                          namespaces={'re': REGEX_NS})

To see the result, you can "dump" the matched nodes: 要查看结果,您可以“转储”匹配的节点:

for comment in comments:
    print("---")
    etree.dump(comment)

You get: 你得到:

---
<div id="comment-1">
      TEXT
    </div>


---
<div id="comment-2">
      TEXT
    </div>


---
<div id="comment-3">
      TEXT
    </div>

the path part in html.findall only allows an XPath subset to be used as expression, it doesn't use regular expressions per default. html.findallpath部分仅允许将XPath subset用作表达式,默认情况下不使用正则表达式。

To do that you'll have to use the EXSLT extension as described here - or you can use the xpath core functions . 为此,您必须使用here所述的EXSLT扩展 - 或者您可以使用xpath core functions

I had a similar desire and did something that while I'm not terribly proud of, got the job done. 我有类似的愿望并做了一些事情,虽然我并不为此感到非常自豪,但却完成了工作。

def node_checker(node):
    if node.attrib['id'].find('hurf-durf') > -1:
        return True
    else:
        return False


for node in itertools.ifilter(node_checker, r.iterdescendants(tag='sometag')):
    print node.tag

Not my finest work, but it got me close enough to getElementById with some flexibility that I was able to move on to another problem. 这不是我最好的工作,但它让我足够接近getElementById具有一定的灵活性,我能够转移到另一个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM