简体   繁体   English

Web爬虫页面迭代

[英]Web crawler page iteration

I have written this code that goes to webMD and so far extracts all the link from each sub category in the message boards. 我已经编写了这个代码转到webMD,到目前为止从消息板中的每个子类别中提取了所有链接。 What I was to do next is to make the program go through all the pages of the subcategory link. 我接下来要做的是让程序遍历子类别链接的所有页面。 I have tried many thing but I always face a problem any idea? 我尝试过很多东西,但我总是面临一个问题吗?

import bs4 as bs
import urllib.request
import pandas as pd


source = urllib.request.urlopen('https://messageboards.webmd.com/').read()

soup = bs.BeautifulSoup(source,'lxml')


df = pd.DataFrame(columns = ['link'],data=[url.a.get('href') for url in soup.find_all('div',class_="link")])
lists=[]
for i in range(0,33):
    link = (df.link.iloc[i])
    source1 = urllib.request.urlopen(link).read()
    soup1 = bs.BeautifulSoup(source1,'lxml')

I've used Python and Wget to do a similar task in the past. 我过去曾使用Python和Wget做类似的任务。 See Wget documentation here . 请在此处查看Wget文档 You can look into the source to get an idea of how it works. 您可以查看源代码以了解其工作原理。

Basically you can do the following. 基本上你可以做到以下几点。 See the follwoing Pseudo code 请参阅下面的伪代码

alreadyDownloadedUrls = []
currentPageUrls = []

def pageDownloader('url'):
    downaload the given URL
    append the url to 'alreadyDownloadedUrls' list
    return the given URL

def urlFinder('inputPage'): 
    finds and returns all the URL of the input page in a list

def urlFilter ('inputUrl or list of URLs'):
    check if the input list of URLs are already in the 'alreadyDownloadedUrls' list, 
    if not appends that URL to a local list variable and returns

def controlFunction(firstPage):
    Download the first page
    firstPageDownload = pageDownloader(firstPage)
    foundUrls = urlFinder (firstPageDownload)
    validUrls = urlFilter(foundUrls)
    currentlyWorkingList = []
    for ( the length of validUrls):
         downloadUrl = pageDownloader(aUrl from the list)
         append to currentlyWorkingList
    for (the lenght of currentlyWorkingList):
        call controlFunction() recursively

However, recursively calling will result you to download the whole internet. 但是, 递归调用将导致您下载整个互联网。 So you have to validate URLs and see if its from the parent domain or subdomain. 因此,您必须验证URL并查看它是来自父域还是子域。 You can do that in the urlFilterFunction . 你可以在urlFilterFunction中做到这一点

Also you have to add some more validation to check if you are downloading the same link with a hash tag at the end of the url. 此外,您还需要添加一些验证,以检查是否在URL的末尾下载了带有哈希标记的相同链接。 Unless your programme will think this and this URLs are pointing to a different pages. 除非您的程序会认为这个并且 URL指向不同的页面。

You may also introduce a depth limit as in Wget 您也可以像在Wget中一样引入深度限制

Hope this clears the idea to you. 希望这能清除你的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM