lxml.html使用XPath和变量进行解析

Question

I have this HTML snippet 我有这个HTML片段

<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>

<ul class="toc">
<li class="level1"><div class="li"><a href="#section">#</a></div>
<ul class="toc">
<li class="level2"><div class="li"><a href="#link1">One</a></div></li>
<li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
<li class="level2"><div class="li"><a href="#link3">Three</a></div></li>

Now I want to parse it with lxml.html. 现在我想用lxml.html解析它。 In the end I want a function where I can provide a searchterm (ie "one") and the function should return 最后我想要一个函数，我可以提供一个searchterm（即“one”），函数应该返回

One
#link1

For now I'm trying to get a variable in the XPath. 现在我想在XPath中获取一个变量。

Works: 作品：

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

print test

Trying with variable. 尝试变量。 I want to replace the hardcoded 'One' with a variable which I can return to the function later. 我想用一个变量替换硬编码的'One' ，我可以稍后返回该函数。

Doesn't work: 不起作用：

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

desiredvars = ['One']
myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars)

for each in myresultset: 
        print each

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
IndexError: list index out of range

This is based on this answer: https://stackoverflow.com/a/10688235/2320453 Any idea why it doesn't work? 这是基于这个答案： https ： //stackoverflow.com/a/10688235/2320453任何想法为什么它不起作用？ Is this the "right way" to do something like this? 这是做这样的事情的“正确方法”吗？

EDIT: To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. 编辑：总结一下：我想在a-Tags中搜索并从这个属性中获取文本，但我不想要一个完整的列表，而是希望能够使用变量进行搜索。 Pseudo-code: 伪代码：

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

searchterm = 'one'

test=html.xpath("...a/text()=searchterm")

print test

Expected result 预期结果

One
#link1

Answer 1

Your first example woks, but probably not how you think it shoud: 你的第一个例子是炒锅，但可能不是你认为的那样：

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

What this returns is a boolean, which will be true if the condition ...='One' is true for any of the nodes in the result set at the left side of the xpath expression. 返回的是一个布尔值，如果条件...='One'对于xpath表达式左侧结果集中的任何节点都为真，则为true。 And that's why you get the error in your second example: True[0] is not valid. 这就是你在第二个例子中得到错误的原因： True[0]无效。

You probalby want all nodes matching the expession, having 'One' as text. 你probalby想要所有节点匹配的东西，有'One'作为文本。 The corresponding expression would be: 相应的表达式为：

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")

This returns a nodeset as result, or if you just need the url as a string: 这将返回一个节点集作为结果，或者如果您只需要将url作为字符串：

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']

Answer 2

I tried mata's response, but for me didn't work: 我尝试过mata的回复，但对我来说没有用：

div_name = 'foo'
my_div = x.xpath(".//div[@id=%s]" %div_name)[0]

I found this on their website http://lxml.de/xpathxslt.html#the-xpath-method for those that might have the same problem : 我在他们的网站http://lxml.de/xpathxslt.html#the-xpath-method上找到了可能有同样问题的人：

div_name = 'foo'
my_div = x.xpath(".//div[@id=$name]", name=div_name)[0]

lxml.html使用XPath和变量进行解析

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-04-29 19:22:12

解决方案2
4 2013-07-19 16:04:06

lxml.html使用XPath和变量进行解析

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-04-29 19:22:12

解决方案2 4 2013-07-19 16:04:06

解决方案1
5 已采纳 2013-04-29 19:22:12

解决方案2
4 2013-07-19 16:04:06