简体   繁体   中英

Python - Can't access some tags using LXML.HTML

Another day another question, sorry for all the posts. Yesterday user "JF Sebastian" gave me a great tip to use LXML.HTML instead of just using LXML only.

I am using it for another feed http://feeds.bbc.co.uk/iplayer/search/tv/?q=news today but I am just not able to access a couple of tags within the content element.

Here is a sample of the feed data:

    <title type="text">BBC News at Six: 06/03/2013</title>
    <content type="html">
    &lt;a href=&quot;http://www.bbc.co.uk/iplayer/episode/b01r27mt/BBC_News_at_Six_06_03_2013/&quot;&gt;
      &lt;img src=&quot;http://ichef.bbci.co.uk/programmeimages/episode/b01r27mt_150_84.jpg&quot; alt=&quot;BBC News at Six: 06/03/2013&quot; /&gt;
    National and international news stories from the BBC News team, followed by weather.
    <category term="News" />
    <category term="TV" />
    <link rel="alternate" href="http://www.bbc.co.uk/iplayer/episode/b01r27mt/BBC_News_at_Six_06_03_2013/" type="text/html" title="BBC News at Six: 06/03/2013">
    <media:thumbnail url="http://ichef.bbci.co.uk/programmeimages/episode/b01r27mt_150_84.jpg" width="150" height="84" />
    <link rel="self" href="http://feeds.bbc.co.uk/iplayer/episode/b01r27mt" type="application/atom+xml" title="06/03/2013" />
    <link rel="related" href="http://www.bbc.co.uk/programmes/b007mpkn/microsite" type="text/html" title="BBC News at Six" />

It appears that the tags within the content tags are text and don't get parsed correctly. Here is my code:

tree = html.parse("http://feeds.bbc.co.uk/iplayer/search/tv/?q=news")
for show in tree.xpath('//entry'):
    select = lambda expr: show.cssselect(expr)[0]
    print "icon_url: ", icon_url
    print "name: ", name
    print "stream: ", stream
    print "date: ", date
    print "content: ", content
    #links = (re.compile ('\n      &lt;p&gt;\n        &lt;a href=&quot;.+?&quot;&gt;\n          &lt;img src=&quot;(.+?)&quot; alt=&quot;.+?&quot; /&gt;\n        &lt;/a&gt;\n      &lt;/p&gt;\n      &lt;p&gt;\n     ').findall(content))
    #print "links: ", links
    #print "short: ", short

I want to get the second p tag with the programme description into the short variable above but I don't seem to be able to select this tag using lxml and I can't get regex to work in selecting the line I want..

Any ideas?

You'll need to unquote that text to get html and then parse it again.

From here

from xml.sax import saxutils as su

unqoutedHtml = su.unescape(content)
newElement = html.document_fromstring(unqoutedHtml)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM