Webscrape没有美丽的汤

Question

I am new to web scraping and python in general, but I was a tad bit stuck on how to correct my function. 我一般对Web抓取和python还是陌生的，但是在如何更正我的功能上我还是有点停留。 My task is to scrape the site of words starting with a specific letter and return a list of the ones that match, preferably using regex. 我的任务是抓取以特定字母开头的单词的站点，并返回匹配列表，最好使用正则表达式。 Thank you for your time, here is my code so far below. 谢谢您的宝贵时间，以下是我的代码。

import urllib
import re

def webscraping(website):
    fhand = urllib.urlopen(website).read()
    for line in fhand:
        line = fhand.strip()
        if line.startswith('h'):
            print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")

Answer 1

Going to go ahead and say this: 继续说一下：

and return a list of the ones that match, preferably using regex.

No. You ~~don't~~ absolutely shouldn't use regex to parse HTML. 不是，你绝对不应该使用正则表达式来解析HTML。 That's why we have HTML parsers exactly for that job. 这就是为什么我们有专门用于该工作的HTML解析器的原因。

Use BeautifulSoup , it has everything built-in and it's relatively easy to do something like this: (Not tested) 使用BeautifulSoup ，它具有所有内置功能，并且做这样的事情相对容易：（未经测试）

def webscraping(website):

   fhand = urllib.urlopen(website).read()
   soup = BeautifulSoup(fhand, "html.parser")
   soup.find_all(text=lambda x: x.startswith('h'))

Answer 2

never use regex to parse HTML, you can use Beautiful Soup here is an example 从不使用正则表达式解析HTML，您可以使用Beautiful Soup，这是一个示例

import urllib
from BeautifulSoup import *

todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)

while len(todo) > 0 :
   print "====== Todo list count is ",len(todo)
   url = todo.pop()

   if ( not url.startswith('http') ) : 
       print "Skipping", url
       continue

   if ( url.find('facebook') > 0 ) :
       continue

   if ( url in visited ) :
       print "Visited", url
       continue

   print "===== Retrieving ", url

   html = urllib.urlopen(url).read()
   soup = BeautifulSoup(html)
   visited.append(url)

   # Retrieve all of the anchor tags
   tags = soup('a')
   for tag in tags:
       newurl = tag.get('href', None)
       if ( newurl != None ) : 
           todo.append(newurl)

Webscrape没有美丽的汤

问题描述

2 个解决方案

解决方案1
1 2016-12-03 02:59:30

解决方案2
0 2016-12-03 06:02:32

Webscrape没有美丽的汤

问题描述

2 个解决方案

解决方案1 1 2016-12-03 02:59:30

解决方案2 0 2016-12-03 06:02:32

解决方案1
1 2016-12-03 02:59:30

解决方案2
0 2016-12-03 06:02:32