[英]Webscrape w/o beautiful soup
I am new to web scraping and python in general, but I was a tad bit stuck on how to correct my function. 我一般对Web抓取和python还是陌生的,但是在如何更正我的功能上我还是有点停留。 My task is to scrape the site of words starting with a specific letter and return a list of the ones that match, preferably using regex.
我的任务是抓取以特定字母开头的单词的站点,并返回匹配列表,最好使用正则表达式。 Thank you for your time, here is my code so far below.
谢谢您的宝贵时间,以下是我的代码。
import urllib
import re
def webscraping(website):
fhand = urllib.urlopen(website).read()
for line in fhand:
line = fhand.strip()
if line.startswith('h'):
print line
webscraping("https://en.wikipedia.org/wiki/Web_scraping")
Going to go ahead and say this: 继续说一下:
and return a list of the ones that match, preferably using regex.
No. You
don't
absolutely shouldn't use regex to parse HTML.
不是
,
你
绝对不应该使用正则表达式来解析HTML。 That's why we have HTML parsers exactly for that job. 这就是为什么我们有专门用于该工作的HTML解析器的原因。
Use BeautifulSoup
, it has everything built-in and it's relatively easy to do something like this: (Not tested) 使用
BeautifulSoup
,它具有所有内置功能,并且做这样的事情相对容易:(未经测试)
def webscraping(website):
fhand = urllib.urlopen(website).read()
soup = BeautifulSoup(fhand, "html.parser")
soup.find_all(text=lambda x: x.startswith('h'))
never use regex to parse HTML, you can use Beautiful Soup here is an example 从不使用正则表达式解析HTML,您可以使用Beautiful Soup,这是一个示例
import urllib
from BeautifulSoup import *
todo = list()
visited = list()
url = raw_input('Enter - ')
todo.append(url)
while len(todo) > 0 :
print "====== Todo list count is ",len(todo)
url = todo.pop()
if ( not url.startswith('http') ) :
print "Skipping", url
continue
if ( url.find('facebook') > 0 ) :
continue
if ( url in visited ) :
print "Visited", url
continue
print "===== Retrieving ", url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
visited.append(url)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
newurl = tag.get('href', None)
if ( newurl != None ) :
todo.append(newurl)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.