簡體   English   中英

美麗的湯使XML數據不完整

[英]Beautiful Soup is getting XML data incomplete

我正在使用Python3.4和Beautiful Soup 4來獲取RSS XML feed的一些數據。 一切似乎都可以正常工作,但有時它的行為不符合預期,因為沒有從列表中的至少一項中獲取<description>標記內的所有數據。
例如,這是給我帶來問題的物品:

<item>
    <title>Google&#8217;s first DeepMind AI health project is missing something</title>
    <link>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/</link>
    <comments>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/#respond</comments>
    <pubDate>Thu, 25 Feb 2016 11:36:56 +0000</pubDate>
    <dc:creator><![CDATA[Kirsty Styles]]></dc:creator>
            <category><![CDATA[Google]]></category>
    <category><![CDATA[Insider]]></category>
    <category><![CDATA[Deepmind]]></category>
    <category><![CDATA[doctor]]></category>
    <category><![CDATA[healthcare]]></category>
    <category><![CDATA[NHS]]></category>
    <category><![CDATA[UK]]></category>

    <guid isPermaLink="false">http://thenextweb.com/?p=957096</guid>
    <description><![CDATA[<img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" alt="Doctors Seek Higher Fees From Health Insurers" title="Google&#039;s first DeepMind AI health project is missing something" data-id="750745" /><br />Having been down at Google’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London&#8230; <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&#038;utm_medium=feed&#038;utm_campaign=profeed">This story continues</a> at The Next Web]]></description>
    <wfw:commentRss>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/feed/</wfw:commentRss>
    <slash:comments>0</slash:comments>
<enclosure url="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" type="image/jpeg" length="0" />
</item>

我正在使用以下代碼來解析數據:

from bs4 import BeautifulSoup
import urllib.request

req = urllib.request.urlopen('http://thenextweb.com/feed/')

xml = BeautifulSoup(req, 'xml')

for item in xml.findAll('item'):
    string = item.description.string
    #new_string = string.split('/>', 1)
    #print(new_string[0]+'/><p>')
    print(string)

當我運行腳本時,一切正常,但是該特定項目失敗了。 代碼中的注釋行用於拆分img並添加<p>標記以對內容進行排序。

我從那個項目得到的結果是:

’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London&#8230; <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&#038;utm_medium=feed&#038;utm_campaign=profeed">This story continues</a> at The Next Web

我不知道發生了什么 如果有人可以幫助我或指導我通過一種方法提取確切的<img>標簽,我將非常感激。

您為什么不只在for循環內搜索description標簽,如下所示:

for item in xml.findAll('item'):
    s = item.find('description')
    print (s)
    >>> <description>&lt;img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2016/02/shutterstock_366588536-520x245.jpg" alt="Fintech" title="5 British companies for FinTech Week" data-id="956789" /&gt;&lt;br /&gt;FinTech, financial technology, is about disrupting the stale financial sector with technology and innovation. Have you accepted the status quo of a bank-led dominance? The people in the flourishing FinTech field have rejected it. Last year, Eileen Burbidge, the UK government’s special envoy for FinTech stated: “London and the UK will lead the FinTech sector.” That’s not hard to believe. With a well-established financial sector, a cultivated tech scene and wide access to capital and talent, London is primed for FinTech. The industry generated over $9 billion in revenue last year. As the UK celebrates #FinTechWeek, we look at five British&amp;#8230; &lt;br&gt;&lt;br&gt;&lt;a href="http://thenextweb.com/insider/2016/02/25/5-british-companies-for-fintech-week/?utm_source=social&amp;#038;utm_medium=feed&amp;#038;utm_campaign=profeed"&gt;This story continues&lt;/a&gt; at The Next Web</description>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM