简体   繁体   English

为什么BeautifulSoup无法正确读取/解析此RSS(XML)文档?

[英]Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

YCombinator is nice enough to provide an RSS feed and a big RSS feed containing the top items on HackerNews . YCombinator非常适合提供RSS提要和包含HackerNews顶级项目的大型RSS提要 I am trying to write a python script to access the RSS feed document and then parse out certain pieces of information using BeautifulSoup. 我正在尝试编写一个python脚本来访问RSS feed文档,然后使用BeautifulSoup解析出某些信息。 However, I am getting some strange behavior when BeautifulSoup tries to get the content of each of the items. 但是,当BeautifulSoup尝试获取每个项目的内容时,我会遇到一些奇怪的行为。

Here are a few sample lines of the RSS feed: 以下是RSS提要的一些示例行:

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

Here is the code I have written (in python) to access this feed and print out the title , link , and comments for each item: 这是我编写的代码(在python中)来访问此feed并打印出每个项目的titlelinkcomments

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

However, this script is giving output that looks like this: 但是,此脚本提供的输出如下所示:

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

As you can see, the middle item, link , is somehow being omitted. 如您所见,中间项link以某种方式被省略。 That is, the resulting value of link is somehow an empty string. 也就是说, link的结果值在某种程度上是一个空字符串。 So why is that? 那为什么呢?

As I dig into what is in soup , I realize that it is somehow choking when it parses the XML. 当我深入研究soup ,我意识到它在解析XML时会以某种方式窒息。 This can be seen by looking at at the first item in items : 这可以通过看在第一项可以看出items

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

You'll notice that something screwy is happening with just the link tag. 你会注意到只有link标签才会发生一些棘手的事情。 It just gets the close tag and then the text for that tag after it. 它只是获取close标签,然后是该标签的文本。 This is some very strange behavior especially in contrast to title and comments being parsed without a problem. 这是一些非常奇怪的行为,特别是与titlecomments被解析没有问题的对比。

This seems to be a problem with BeautifulSoup because what is actually read in by requests doesn't have any problems with it. 这似乎是BeautifulSoup的问题,因为请求实际读入的内容没有任何问题。 I don't think it is limited to BeautifulSoup though because I tried using xml.etree.ElementTree API as well and the same problem arose (is BeautifulSoup built on this API?). 我不认为它仅限于BeautifulSoup,因为我也尝试使用xml.etree.ElementTree API并且出现了同样的问题(在这个API上构建了BeautifulSoup吗?)。

Does anyone know why this would be happening or how I can still use BeautifulSoup without getting this error? 有谁知道为什么会发生这种情况或者如何在不出现此错误的情况下仍然使用BeautifulSoup?

Note: I was finally able to get what I wanted with xml.dom.minidom, but this doesn't seem like a highly recommended library. 注意:我终于能够通过xml.dom.minidom获得我想要的内容,但这似乎不是一个强烈推荐的库。 I would like to continue using BeautifulSoup if possible. 如果可能的话,我想继续使用BeautifulSoup。

Update : I am on a Mac with OSX 10.8 using Python 2.7.2 and BS4 4.1.3. 更新 :我使用的是OSX 10.8,使用Python 2.7.2和BS4 4.1.3。

Update 2 : I have lxml and it was installed with pip. 更新2 :我有lxml,它是用pip安装的。 It is version 3.0.2. 它是3.0.2版。 As far as libxml, I checked in /usr/lib and the one that shows up is libxml2.2.dylib. 至于lib​​xml,我检查了/ usr / lib,显示的是libxml2.2.dylib。 Not sure when or how that was installed. 不确定何时或如何安装。

Wow, great question. 哇,好问题。 This strikes me as a bug in BeautifulSoup. 这让我觉得它是BeautifulSoup中的一个错误。 The reason that you can't access the link using soup.find_all('item').link is that when you first load the html into BeautifulSoup to begin with, it does something odd to the HTML: 您无法使用soup.find_all('item').link访问该链接的原因是,当您首次将html加载到BeautifulSoup中时,它会对HTML做一些奇怪的事情:

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

Look carefully--it has actually changed the first <link> tag to <link/> and then removed the </link> tag. 仔细看 - 它实际上已将第一个<link>标记更改为<link/> ,然后删除了</link>标记。 I'm not sure why it would do this, but without fixing the problem in the BeautifulSoup.BeautifulSoup class initialization, you're not going to be able to use it for now. 我不确定它为什么会这样做,但是如果没有修复BeautifulSoup.BeautifulSoup类初始化中的问题,你现在就无法使用它了。

Update: 更新:

I think your best (albeit hack-y) bet for now is to use the following for link : 我认为你现在最好的(虽然是hack-y)赌注是使用以下link

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

Actually, the problem seems to be related with the parser you are using. 实际上,问题似乎与您正在使用的解析器有关。 By default, a HTML one is used. 默认情况下,使用HTML。 Try using soup = BeautifulSoup(request.text, 'xml') after installing the lxml module. 安装lxml模块后,尝试使用soup = BeautifulSoup(request.text,'xml')。

It will then use a XML parser instead of a HTML one and it should be all ok. 然后它将使用XML解析器而不是HTML解析器,它应该都可以。

See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for more info 有关详细信息,请参阅http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

@Yan Hudon是对的。我用soup = BeautifulSoup(request.text, 'xml')解决了这个问题

I don't think there's a bug in BeautifulSoup here. 我不认为BeautifulSoup中有一个错误。

I installed a clean copy of BS4 4.1.3 on Apple's stock 2.7.2 from OS X 10.8.2, and everything worked as expected. 我从OS X 10.8.2在Apple的2.7.2版本上安装了BS4 4.1.3的干净副本,一切都按预期工作。 It doesn't mis-parse the <link> as </link> , and therefore it doesn't have the problem with the item.find('link') . 它不会将<link>错误解析为</link> ,因此它没有item.find('link')

I also tried using the stock xml.etree.ElementTree and xml.etree.cElementTree in 2.7.2, and xml.etree.ElementTree in python.org 3.3.0, to parse the same thing, and it again worked fine. 我也试过用股票xml.etree.ElementTreexml.etree.cElementTree 2.7.2,并xml.etree.ElementTree在python.org 3.3.0,解析同样的事情,一切又恢复了正常。 Here's the code: 这是代码:

import xml.etree.ElementTree as ET

rss = ET.fromstring(x)
for channel in rss.findall('channel'):
  for item in channel.findall('item'):
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print(title)
    print(link)
    print(comments)

I then installed lxml 3.0.2 (I believe BS uses lxml if available), using Apple's built-in /usr/lib/libxml2.2.dylib (which, according to xml2-config --version , is 2.7.8), and did the same tests using its etree, and using BS, and again, everything worked. 然后我安装了lxml 3.0.2(我相信BS使用lxml,如果可用的话),使用Apple的内置/usr/lib/libxml2.2.dylib (根据xml2-config --version ,它是2.7.8),并使用其etree和使用BS进行相同的测试,并且一切都有效。

In addition to screwing up the <link> , jdotjdot's output shows that BS4 is screwing up the <description> in an odd way. 除了搞砸了<link> ,jdotjdot的输出显示BS4正以奇怪的方式搞砸了<description> The original is this: 原来是这样的:

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

His output is: 他的输出是:

<description>Comments]]&gt;</description>

My output from running his exact same code is: 运行他完全相同的代码的输出是:

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

So, it seems like there's a much bigger problem going on here. 所以,似乎这里有一个更大的问题。 The odd thing is that it's happening to two different people, when it isn't happening on a clean install of the latest version of anything. 奇怪的是,它发生在两个不同的人身上,当它没有发生在最新版本的任何东西的干净安装上。

That implies either that it's a bug that's been fixed and I just have a newer version of whatever had the bug, or it's something weird about the way they both installed something. 这意味着它是一个已修复的错误,我只是有一个更新版本的任何有bug,或者它们安装东西的方式有些奇怪。

BS4 itself can be ruled out, since at least Treebranch has 4.1.3 just like me. 可以排除BS4本身,因为至少Treebranch就像我一样拥有4.1.3。 Although, without knowing how he installed it, it could be a problem with the installation. 虽然在不知道如何安装它的情况下,但这可能是安装的问题。

Python and its built-in etree can be ruled out, since at least Treebranch has the same stock Apple 2.7.2 from OS X 10.8 as me. 可以排除Python及其内置的etree,因为至少Treebranch与OS X 10.8具有相同的Apple 2.7.2库存。

It could very well be a bug with lxml or the underlying libxml, or the way they were installed. 它很可能是lxml或底层libxml的错误,或者它们的安装方式。 I know jdotjdot has lxml 2.3.6, so this could be a bug that's been fixed somewhere between 2.3.6 and 3.0.2. 我知道jdotjdot有lxml 2.3.6,所以这可能是一个已修复到2.3.6和3.0.2之间的错误。 In fact, given that, according to the lxml website and the change notes for any version after 2.3.5, there is no 2.3.6, so whatever he has may be some kind of buggy release from very early on a canceled branch or something… I don't know his libxml version, or how either was installed, or what platform he's on, so it's hard to guess, but at least this is something that can be investigated. 事实上,鉴于此,根据LXML网站和2.3.5之后的任何版本的变化笔记, 没有 2.3.6,所以无论他有可能是某种车的释放,从上取消分支或一些非常早期...我不知道他的libxml版本,或者是如何安装的,或者他在哪个平台上,所以很难猜测,但至少这是可以调查的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM