简体   繁体   English

lxml:Xpath 在 Chrome 中有效,但在 lxml 中无效

[英]lxml: Xpath works in Chrome but not in lxml

I'm trying to scrape information from this episode wiki page on Fandom , specifically the episode title in Japanese,謀略Ⅳ:ドライバーを奪還せよ! :我正试图从Fandom 上这一集的维基页面上抓取信息,特别是日文的剧集标题,謀略Ⅳ:ドライバーを奪還せよ! :

Conspiracy IV: Recapture the Driver, (謀略Ⅳ:ドライバーを奪還せよ!: Bōryaku Fō: Doraibā o Dakkan seyo!)阴谋四:夺回司机, (谋略Ⅳ:ドライバーを夺还せよ!: Bōryaku Fō: Doraibā o Dakkan seyo!)

I wrote this xpath which selects the text in Chrome: //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text() , but it does not work in lxml when I do this:我写了这个 xpath,它在 Chrome 中选择文本: //div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text() ,但它在 lxml 中不起作用当我这样做时:

import requests
from lxml import html

getPageContent = lambda url : html.fromstring(requests.get(url).content)
content = getPageContent("https://kamenrider.fandom.com/wiki/Conspiracy_IV:_Recapture_the_Driver!")
JapaneseTitle = content.xpath("//div[@class='mw-parser-output']/span/span[@class='t_nihongo_kanji']/text()")
print(JapaneseTitle)

I had already written these xpaths to scrape other parts of the page which are working:我已经编写了这些 xpaths 来抓取页面的其他工作部分:

  • //h2[@data-source='name']/center/text() , the episode title in English. //h2[@data-source='name']/center/text() ,英文剧集标题。
  • //div[@data-source='airdate']/div/text() , the air date. //div[@data-source='airdate']/div/text() ,播出日期。
  • //div[@data-source='writer']/div/a , the episode writer a element. //div[@data-source='writer']/div/a a剧集作者元素。
  • //div[@data-source='director']/div/a , the episode director a element. //div[@data-source='director']/div/a a剧集导演元素。
  • //p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]] , all the p elements under the Snyposis section. //p[preceding-sibling::h2[contains(span,'Synopsis')] and following-sibling::h2[contains(span,'Plot')]] , Snyposis 部分下的所有p元素。

As with all questions of this sort, start by breaking down your xpath into smaller expressions:与所有此类问题一样,首先将您的 xpath 分解为更小的表达式:

Let's start with the first expression...让我们从第一个表达式开始......

>>> content.xpath("//div[@class='mw-parser-output']")
[<Element div at 0x7fbf905d5400>]

Great, that works.太好了,那行得通。 But if we add the next component from your expression...但是如果我们从你的表达式中添加下一个组件......

>>> content.xpath("//div[@class='mw-parser-output']/span")
[]

...we don't get any results. ...我们没有得到任何结果。 It looks like the <div> element matched by the first component of your expression doesn't have any immediate descendants that are <span> elements.看起来与表达式的第一个组件匹配的<div>元素没有任何<span>元素的直接后代。

If we select the relevant element in Chrome and select "inspect element", and then "copy full xpath", we get:如果我们 select Chrome 中的相关元素和 select “检查元素”,然后“复制完整的 xpath”,我们得到:

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/span/span[1]

And that looks like it should match.这看起来应该匹配。 But if we match it (or at least a similar element) using lxml , we see a different path:但是如果我们使用lxml匹配它(或至少是一个相似的元素),我们会看到一条不同的路径:

>>> res=content.xpath('//span[@class="t_nihongo_kanji"]')[0]
>>> tree = content.getroottree()
>>> tree.getpath(res)
'/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1]/span/span[1]'

The difference is here:区别就在这里:

/html/body/div[4]/div[3]/div[2]/main/div[3]/div[2]/div/p[1] <-- extra <p> element

One solution is simply to ignore the difference in structure by sticking a // in the middle of the expression, so that we have something like:一种解决方案是通过在表达式中间添加//来忽略结构上的差异,这样我们就有了类似的东西:

>>> content.xpath("(//div[@class='mw-parser-output']//span[@class='t_nihongo_kanji'])[1]/text()")
['謀略Ⅳ:ドライバーを奪還せよ!']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM