lxml：Xpath 在 Chrome 中有效，但在 lxml 中无效

Question

I'm trying to scrape information from this episode wiki page on Fandom , specifically the episode title in Japanese,謀略Ⅳ：ドライバーを奪還せよ！ :我正试图从Fandom 上这一集的维基页面上抓取信息，特别是日文的剧集标题，謀略Ⅳ：ドライバーを奪還せよ！ :

Conspiracy IV: Recapture the Driver, (謀略Ⅳ：ドライバーを奪還せよ！: Bōryaku Fō: Doraibā o Dakkan seyo!)阴谋四：夺回司机, (谋略Ⅳ：ドライバーを夺还せよ！: Bōryaku Fō: Doraibā o Dakkan seyo!)

I wrote this xpath which selects the text in Chrome: //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text() , but it does not work in lxml when I do this:我写了这个 xpath，它在 Chrome 中选择文本： //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text() ，但它在 lxml 中不起作用当我这样做时：

import requests
from lxml import html

getPageContent = lambda url : html.fromstring(requests.get(url).content)
content = getPageContent("https://kamenrider.fandom.com/wiki/Conspiracy_IV:_Recapture_the_Driver!")
JapaneseTitle = content.xpath("//div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text()")
print(JapaneseTitle)

I had already written these xpaths to scrape other parts of the page which are working:我已经编写了这些 xpaths 来抓取页面的其他工作部分：

//h2[@data-source='name']/center/text() , the episode title in English. //h2[@data-source='name']/center/text() ，英文剧集标题。
//div[@data-source='airdate']/div/text() , the air date. //div[@data-source='airdate']/div/text() ，播出日期。
//div[@data-source='writer']/div/a , the episode writer a element. //div[@data-source='writer']/div/a a剧集作者元素。
//div[@data-source='director']/div/a , the episode director a element. //div[@data-source='director']/div/a a剧集导演元素。
//p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]] , all the p elements under the Snyposis section. //p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]] , Snyposis 部分下的所有p元素。

Answer 1

As with all questions of this sort, start by breaking down your xpath into smaller expressions:与所有此类问题一样，首先将您的 xpath 分解为更小的表达式：

Let's start with the first expression...让我们从第一个表达式开始......

>>> content.xpath("//div[@class='mw-parser-output']")
[<Element div at 0x7fbf905d5400>]

Great, that works.太好了，那行得通。 But if we add the next component from your expression...但是如果我们从你的表达式中添加下一个组件......

>>> content.xpath("//div[@class='mw-parser-output']/span")
[]

...we don't get any results. ...我们没有得到任何结果。 It looks like the <div> element matched by the first component of your expression doesn't have any immediate descendants that are <span> elements.看起来与表达式的第一个组件匹配的<div>元素没有任何<span>元素的直接后代。

If we select the relevant element in Chrome and select "inspect element", and then "copy full xpath", we get:如果我们 select Chrome 中的相关元素和 select “检查元素”，然后“复制完整的 xpath”，我们得到：

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/span/span[1]

And that looks like it should match.这看起来应该匹配。 But if we match it (or at least a similar element) using lxml , we see a different path:但是如果我们使用lxml匹配它（或至少是一个相似的元素），我们会看到一条不同的路径：

>>> res=content.xpath('//span[@class="t_nihongo_kanji"]')[0]
>>> tree = content.getroottree()
>>> tree.getpath(res)
'/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1]/span/span[1]'

The difference is here:区别就在这里：

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1] <-- extra <p> element

One solution is simply to ignore the difference in structure by sticking a // in the middle of the expression, so that we have something like:一种解决方案是通过在表达式中间添加//来忽略结构上的差异，这样我们就有了类似的东西：

>>> content.xpath("(//div[@class='mw-parser-output']//span[@class='t_nihongo_kanji'])[1]/text()")
['謀略Ⅳ：ドライバーを奪還せよ！']

lxml：Xpath 在 Chrome 中有效，但在 lxml 中无效

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-12-04 22:11:06

lxml：Xpath 在 Chrome 中有效，但在 lxml 中无效

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-12-04 22:11:06

解决方案1
0 已采纳 2022-12-04 22:11:06