繁体   English   中英

Python Beautifulsoup。 解析 <p></p>

[英]Python Beautifulsoup. Parse <p></p>

我正在学习如何与Beautifulsoup进行解析。 有人可以解释我如何解析div class="article-content" <p></p>元素。 在脚本启动后,我只想查看内容信息。 让我演示一下我想要的:

在此处输入图片说明

我可以解析div class="article-content"但不能在<p></p>所需信息。 我的代码看起来像这样:

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.engadget.com/2014/10/17/local-multiplayer-is-coming-to-android-games/')
parsed_html = BeautifulSoup(html)
print parsed_html.body.find('div', attrs={'class':'article-content'}).text

但是我也有很多垃圾:

$ python engadget_parser.py


Ever wish that you could just whip out your Android device and harass a passer-by to play games with you? It's the sort of thing that Nintendo DS users, for example, have been using thanks to that company's StreetPass feature, but, until now, hasn't been available on Google's smartphones. Now, however, the company has an added an update to its games infrastructure that enables "ambient, real-time" games with more than one user - so long that the game relies upon Google's home-grown multiplayer backend. Still, maybe don't sprint into the street and start challenging people to a dual, because they might get the wrong idea.





        onBreak({
            0: function(){
                (function() {
                        var a = {
                                mobilePlacementID: "348-14-15-135b",
                        width: "320",
                        height: "115"
                        };
                    madserver.requestAd(a);
                })();
            },
            768: function(){}
        });






Source: Android Developers (G+)



Tags: android, AndroidGames, gaming, google, googleplaygames, mobile, mobilepostcross





 Hide Comments
0Comments










            _when_.eng("eng.livefyre.init", {
                articleId: 20979699 ,
                domain: "engadget.fyre.co" ,
                siteId: "296092" ,
                el: "livefyre_20979699",
                initialNumVisible: 2
            })



_when_.eng("eng.perm.init");



lab.scriptBs('gravity.js')




onBreak({
    0: function(){},
    320: function(){},
    768: function(){}
});

谢谢!

在这种情况下,我喜欢beautifulsoup的选择方法。 替换为:

print parsed_html.body.find('div', attrs={'class':'article-content'}).text

有了这个:

for p in parsed_html.select('div.article-content p'):
    print p.text

也许这是非常糟糕的代码,但是无论如何我都会向他展示,不要戳我,我只是Python的初学者:

import urllib2
from bs4 import BeautifulSoup

url  = "http://www.engadget.com/2014/10/17/castar-augmented-reality/"

html = urllib2.urlopen(url)
parsed_html = BeautifulSoup(html)


def news_parser(url):
    list = []
    for p in parsed_html.select('div.article-content p'):
        list.append(p.text)
    return list


def longest_text_position(list):
    # sometimes article is not in list[1] position, so I am searching a longest element in list
    a = 0
    longest_text = ""

    for item in list:
        x = len(item)
        if x > a:
            a = x
            longest_text = item

    position = list.index(longest_text)  
    return position


def print_news(position):
    print "-" * 80
    print parsed_html.title.string
    print "-" * 80
    print list[position]
    print "-" * 80
    print " "

list = news_parser(url)
position = longest_text_position(list)
print_news(position)

结果是:

$ python engadget_parser_new.py 
--------------------------------------------------------------------------------
castAR bets big on its augmented reality hardware with move to Silicon Valley
--------------------------------------------------------------------------------
And they certainly were. From just a brief hands-on with the new hardware, I could tell the  make out ....ating that I could look around objects by just walking around the table. Henkel-Wallace mentioned a potential for a holodeck application by blanketing a room with that retroreflective material, and I could certainly see a use case for that.
--------------------------------------------------------------------------------

谢谢@Vincent Beltman。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM