简体   繁体   English

使用lxml.html vs BeautifulSoup定位元素

[英]locate element using lxml.html vs BeautifulSoup

I'm scraping an html document using lxml.html ; 我正在使用lxml.html抓取一个html文档; there's one thing I can do in BeautifulSoup , but don't manage to do with lxml.htm. 我可以在BeautifulSoup做一件事,但不能与lxml.htm有关。 Here it is: 这里是:

from BeautifulSoup import BeautifulSoup
import re

doc = ['<html>',
'<h2> some text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> A table</td> </tr> </table>',
'<h2> some special text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> The table I want </td> </tr> </table>',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.find(text=re.compile("special")).findNext('table')

I tried this with cssselect , but no success. 我用cssselect尝试了这个,但没有成功。 Any ideas on how I could locate this using the methods in lxml.html ? 有关如何使用lxml.html的方法找到它的任何想法?

Many thanks, D 非常感谢,D

You can use a regular expression in an lxml Xpath, by using EXSLT syntax . 您可以使用EXSLT语法在lxml Xpath中使用正则表达式。 For example, given your document, this will select the parent node whose text matches the regexp spe.*al : 例如,给定您的文档,这将选择其文本与正则表达式匹配的父节点spe.*al

import re
import lxml.html

NS = 'http://exslt.org/regular-expressions'
tree = lxml.html.fromstring(DOC)

# select sibling table nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::table"
print tree.xpath(path, namespaces={'re': NS})

# select all sibling nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::*"
print tree.xpath(path, namespaces={'re': NS})

Output: 输出:

[<Element table at 7fe21acd3f58>]
[<Element p at 7f76ac2c3f58>, <Element table at 7f76ac2e6050>]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM