Python - 使用BeautifulSoup从URL列表中删除文本的最简单方法

Question

What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? 使用BeautifulSoup从少数几个网页（使用URL列表）中删除文本的最简单方法是什么？ Is it even possible? 它甚至可能吗？

Best, Georgina 最好的，乔治娜

Answer 1

import urllib2
import BeautifulSoup
import re

Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urllib2.urlopen(url).read()
    # parse as html structured document
    bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    # kill javascript content
    for s in bs.findAll('script'):
        s.replaceWith('')
    # find body and extract text
    txt = bs.find('body').getText('\n')
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

It now removes javascript and decodes html entities. 它现在删除javascript并解码html实体。

Answer 2

It is perfectly possible. 这是完全可能的。 Easiest way is to iterate through list of URLs, load the content, find the URLs, add them to main list. 最简单的方法是遍历URL列表，加载内容，查找URL，将它们添加到主列表中。 Stop iteration when enough pages are found. 找到足够的页面时停止迭代。

Just some tips: 只是一些提示：

urllib2.urlopen for fetching content urllib2.urlopen用于获取内容
BeautifulSoup : findAll('a') for finding URLs BeautifulSoup ：findAll（'a'）用于查找URL

Answer 3

我知道这不是你的确切问题（关于BeautifulSoup）的答案，但一个好主意是看看Scrapy ，它似乎符合你的需要。

Python - 使用BeautifulSoup从URL列表中删除文本的最简单方法

问题描述

3 个解决方案

解决方案1
5 已采纳 2011-03-16 20:44:34

解决方案2
1 2011-03-16 20:35:00

解决方案3
1 2011-03-16 21:09:19

Python - 使用BeautifulSoup从URL列表中删除文本的最简单方法

问题描述

3 个解决方案

解决方案1 5 已采纳 2011-03-16 20:44:34

解决方案2 1 2011-03-16 20:35:00

解决方案3 1 2011-03-16 21:09:19

解决方案1
5 已采纳 2011-03-16 20:44:34

解决方案2
1 2011-03-16 20:35:00

解决方案3
1 2011-03-16 21:09:19