简体   繁体   English

Webscrape没有美丽的汤

[英]Webscrape w/o beautiful soup

I am new to web scraping and python in general, but I was a tad bit stuck on how to correct my function. 我一般对Web抓取和python还是陌生的,但是在如何更正我的功能上我还是有点停留。 My task is to scrape the site of words starting with a specific letter and return a list of the ones that match, preferably using regex. 我的任务是抓取以特定字母开头的单词的站点,并返回匹配列表,最好使用正则表达式。 Thank you for your time, here is my code so far below. 谢谢您的宝贵时间,以下是我的代码。

import urllib
import re

def webscraping(website):
    fhand = urllib.urlopen(website).read()
    for line in fhand:
        line = fhand.strip()
        if line.startswith('h'):
            print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")

Going to go ahead and say this: 继续说一下:

and return a list of the ones that match, preferably using regex. 

No. You don't absolutely shouldn't use regex to parse HTML. 不是 绝对不应该使用正则表达式来解析HTML。 That's why we have HTML parsers exactly for that job. 这就是为什么我们有专门用于该工作的HTML解析器的原因。

Use BeautifulSoup , it has everything built-in and it's relatively easy to do something like this: (Not tested) 使用BeautifulSoup ,它具有所有内置功能,并且做这样的事情相对容易:(未经测试)

def webscraping(website):

   fhand = urllib.urlopen(website).read()
   soup = BeautifulSoup(fhand, "html.parser")
   soup.find_all(text=lambda x: x.startswith('h'))

never use regex to parse HTML, you can use Beautiful Soup here is an example 从不使用正则表达式解析HTML,您可以使用Beautiful Soup,这是一个示例

import urllib
from BeautifulSoup import *

todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)

while len(todo) > 0 :
   print "====== Todo list count is ",len(todo)
   url = todo.pop()

   if ( not url.startswith('http') ) : 
       print "Skipping", url
       continue

   if ( url.find('facebook') > 0 ) :
       continue

   if ( url in visited ) :
       print "Visited", url
       continue

   print "===== Retrieving ", url

   html = urllib.urlopen(url).read()
   soup = BeautifulSoup(html)
   visited.append(url)

   # Retrieve all of the anchor tags
   tags = soup('a')
   for tag in tags:
       newurl = tag.get('href', None)
       if ( newurl != None ) : 
           todo.append(newurl)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM