简体   繁体   English

Python - 使用BeautifulSoup从URL列表中删除文本的最简单方法

[英]Python - Easiest way to scrape text from list of URLs using BeautifulSoup

What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? 使用BeautifulSoup从少数几个网页(使用URL列表)中​​删除文本的最简单方法是什么? Is it even possible? 它甚至可能吗?

Best, Georgina 最好的,乔治娜

import urllib2
import BeautifulSoup
import re

Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urllib2.urlopen(url).read()
    # parse as html structured document
    bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    # kill javascript content
    for s in bs.findAll('script'):
        s.replaceWith('')
    # find body and extract text
    txt = bs.find('body').getText('\n')
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

It now removes javascript and decodes html entities. 它现在删除javascript并解码html实体。

It is perfectly possible. 这是完全可能的。 Easiest way is to iterate through list of URLs, load the content, find the URLs, add them to main list. 最简单的方法是遍历URL列表,加载内容,查找URL,将它们添加到主列表中。 Stop iteration when enough pages are found. 找到足够的页面时停止迭代。

Just some tips: 只是一些提示:

  • urllib2.urlopen for fetching content urllib2.urlopen用于获取内容
  • BeautifulSoup : findAll('a') for finding URLs BeautifulSoup :findAll('a')用于查找URL

我知道这不是你的确切问题(关于BeautifulSoup)的答案,但一个好主意是看看Scrapy ,它似乎符合你的需要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM