使用lxml和etree的问题剪贴元素和子文本

Question

我正在尝试以特定格式从Wikipedia页面上抓取列表（例如这样的页面： https : //de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Sk%C3%A1lholt ）。 我遇到使“ li”和“ a href”匹配的问题。

例如，在上面的页面中，第九个项目符号包含以下文本：

1238–1268年：SigvarðurÞéttmarsson（挪威）

使用HTML：

 <li>1238–1268: <a href="/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson" title="Sigvarður Þéttmarsson">Sigvarður Þéttmarsson</a> (Norweger)</li>

我想将其作为字典：

'1238–1268：SigvarðurÞéttmarsson（挪威）'：'/ wiki / Sigvar％C3％B0ur_％C3％9E％C3％A9ttmarsson'

['li'和'a'子的两个部分的全部文本]：['a'子的href]

我知道我可以使用lxml / etree来做到这一点，但我不完全确定该怎么做。 下面的一些重组？

from lxml import etree
tree = etree.HTML(html)

bishops = tree.cssselect('li').text for bishop
text = [li.text for li in bishops]

links = tree.cssselect('li a')
hrefs = [bishop.get('href') for bishop in links]

Answer 1

更新：我已经使用BeautifulSoup弄清楚了，如下所示：

 from bs4 import BeautifulSoup

 html = driver.page_source
 soup = BeautifulSoup(html, 'html.parser')

 bishops_with_links = {}
 bishops = soup.select('li')

 for bishop in bishops:
     if bishop.findChildren('a'):
         bishops_with_links[bishop.text] = 'https://de.wikipedia.org' + bishop.a.get('href')
     else:
         bishops_with_links[bishop.text] = ''
 return bishops_with_links

使用lxml和etree的问题剪贴元素和子文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-02-04 16:16:13

使用lxml和etree的问题剪贴元素和子文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-02-04 16:16:13

解决方案1
0 已采纳 2019-02-04 16:16:13