用美麗的湯刮

Question

我正在使用BeautifulSoup抓取一篇文章。 除了某個部分，我想抓取文章正文中的所有p標簽。 我想知道是否有人可以給我提示我做錯了什么？ 我沒有收到錯誤，只是沒有表現出任何不同。 目前，它正在從不需要的部分中抓取“打印”一詞，並與其他p標簽一起打印。

我要忽略的部分：soup.find（“ div”，{'class'：'add-this'}）

    url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig

    # Parse HTML of article, aka making soup
    soup = BeautifulSoup(urllib2.urlopen(url).read())

    # Retrieve all of the paragraphs
    tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
    for tag in tags:
        ptags = soup.find("div", {'class': 'add-this'})
        for tag in ptags:
            txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
        else:
            txt.write(tag.text.encode('utf-8') + '\n' + '\n')

Answer 1

一種選擇是僅傳遞recursive=False ，以免在完全fullstory div的任何其他元素內搜索p標簽：

tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
    print tag.text

這將僅從div中獲取頂級段落，並打印完整的文章：

10 April 2014  The United Nations today called on the Government...
...
...follow up with the Government on these concerns.

用美麗的湯刮

問題描述

1 個解決方案

解決方案1
1 已采納 2014-04-14 15:07:54

用美麗的湯刮

問題描述

1 個解決方案

解決方案1 1 已采納 2014-04-14 15:07:54

解決方案1
1 已采納 2014-04-14 15:07:54