简体   繁体   English

使用lxml来解析namepaced HTML?

[英]Using lxml to parse namepaced HTML?

This is driving me totally nuts, I've been struggling with it for many hours. 这让我完全疯了,我已经挣扎了好几个小时。 Any help would be much appreciated. 任何帮助将非常感激。

I'm using PyQuery 1.2.9 (which is built on top of lxml ) to scrape this URL . 我正在使用PyQuery 1.2.9(它建立在lxml之上)来抓取这个URL I just want to get a list of all the links in the .linkoutlist section. 我只想获得.linkoutlist部分中所有链接的列表。

This is my request in full: 这是我的全部要求:

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links

But that returns an empty array. 但是返回一个空数组。 If I use this query instead: 如果我使用此查询:

links = doc('#maincontent .linkoutlist')

Then I get this back this HTML: 然后我得到这个HTML:

<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
   <h4>Full Text Sources</h4>
   <ul>
      <li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&amp;volume=19&amp;issue=3&amp;spage=125" ref="itool=Abstract&amp;PrId=3159&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Lippincott Williams &amp; Wilkins</a></li>
      <li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&amp;PAGE=linkout&amp;SEARCH=15107654.ui" ref="itool=Abstract&amp;PrId=3682&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
   </ul>
   <h4>Other Literature Sources</h4>
   ...
</div>

So the parent selectors do return HTML with lots of <a> tags. 所以父选择器确实返回带有大量<a>标签的HTML。 This also appears to be valid HTML. 这似乎也是有效的HTML。

More experimenting reveals that lxml does not like the xmlns attribute on the opening div, for some reason. 更多实验表明,由于某种原因,lxml不喜欢开始div上的xmlns属性。

How can I ignore this in lxml, and just parse it like regular HTML? 我如何在lxml中忽略它,并像普通HTML一样解析它?

UPDATE : Trying ns_clean , still failing: 更新 :尝试ns_clean ,仍然失败:

    parser = etree.XMLParser(ns_clean=True)
    tree = etree.parse(StringIO(response.content), parser)
    sel = CSSSelector('#maincontent .rprt_all a')
    print sel(tree)

You need to handle namespaces , including an empty one. 您需要处理名称空间 ,包括空名称空间

Working solution: 工作方案:

from pyquery import PyQuery as pq
import requests


response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')

namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
    print link.attrib.get("title", "No title")

Prints titles of all links matching the selector: 打印与选择器匹配的所有链接的标题:

Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource

Or, just set the parser to "html" and forget about namespaces: 或者,只需将parser设置为"html"并忘记命名空间:

links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
    print link.attrib.get("title", "No title")

Good luck getting a standard XML/DOM parse to work on most HTML. 祝你有一个标准的XML / DOM解析来处理大多数HTML。 Your best bet would be to use BeautifulSoup ( pip install beautifulsoup4 or easy_install beautifulsoup4 ), which has a lot of handling for incorrectly built structures. 你最好的选择是使用BeautifulSouppip install beautifulsoup4easy_install beautifulsoup4 ),它对错误构建的结构有很多处理。 Maybe something like this instead? 也许这样的事情呢?

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
bs = BeautifulSoup(response.content)
div = bs.find('div', class_='linkoutlist')
links = [ a['href'] for a in div.find_all('a') ]

>>> links
['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125', 'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui', 'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&ln_o=linkout', 'http://www.diseaseinfosearch.org/result/2199', 'http://www.nlm.nih.gov/medlineplus/antidepressants.html', 'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs+hsdb:@term+@rn+24219-97-4']

I know it's not the library you were looking to use, but I have historically slammed my head into walls on many occasions when it comes to DOM. 我知道这不是你想要使用的图书馆,但是在DOM方面,我曾多次将我的头撞到墙上。 The creators of BeautifulSoup have circumvented many edge cases that tend to happen in the wild. BeautifulSoup的创造者已经规避了许多倾向于在野外发生的边缘情况。

If I remember correctly from having a similar problem myself a while ago. 如果我没记错,我自己也有一个类似的问题。 You can "ignore" the namespace by mapping it to None like this: 您可以通过将命名空间映射到None来“忽略”命名空间,如下所示:

sel = CSSSelector('#maincontent .rprt_all a', namespaces={None: "http://www.w3.org/1999/xhtml"})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM