简体   繁体   English

Python简单的Web搜寻器错误(无限循环搜寻)

[英]Python simple web crawler error (infinite loop crawling)

I wrote a simple crawler in python. 我用python编写了一个简单的搜寻器。 It seems to work fine and find new links, but repeats the finding of the same links and it is not downloading the new web pages found. 它似乎可以正常工作并找到新的链接,但是重复查找相同的链接,并且它没有下载找到的新网页。 It seems like it crawls infinitely even after it reaches the set crawling depth limit. 即使达到设定的爬行深度限制,它似乎仍会无限爬行。 I am not getting any errors. 我没有任何错误。 It just runs forever. 它永远运行。 Here is the code and the run. 这是代码和运行。 I am using Python 2.7 on Windows 7 64bit. 我在Windows 7 64bit上使用Python 2.7。

import sys
import time
from bs4 import *
import urllib2
import re
from urlparse import urljoin

def crawl(url):
    url = url.strip()
    page_file_name = str(hash(url))
    page_file_name = page_file_name + ".html" 
    fh_page = open(page_file_name, "w")
    fh_urls = open("urls.txt", "a")
    fh_urls.write(url + "\n")
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    html_text = str(soup)
    fh_page.write(url + "\n")
    fh_page.write(page_file_name + "\n")
    fh_page.write(html_text)
    links = []
    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    links.append(link.get('href'))
    rs = []
    for link in links:
    try:
            #r = urllib2.urlparse.urljoin(url, link)
            r = urllib2.urlopen(link)
            r_str = str(r.geturl())
            fh_urls.write(r_str + "\n")
            #a = urllib2.urlopen(r)
            if r.headers['content-type'] == "html" and r.getcode() == 200:
                rs.append(r)
                print "Extracted link:"
        print link
        print "Extracted link final URL:"
        print r
    except urllib2.HTTPError as e:
            print "There is an error crawling links in this page:"
            print "Error Code:"
            print e.code
    return rs
    fh_page.close()
    fh_urls.close()

if __name__ == "__main__":
    if len(sys.argv) != 3:
    print "Usage: python crawl.py <seed_url> <crawling_depth>"
    print "e.g: python crawl.py https://www.yahoo.com/ 5"
    exit()
    url = sys.argv[1]
    depth = sys.argv[2]
    print "Entered URL:"
    print url
    html_page = urllib2.urlopen(url)
    print "Final URL:"
    print html_page.geturl()
    print "*******************"
    url_list = [url, ]
    current_depth = 0
    while current_depth < depth:
        for link in url_list:
            new_links = crawl(link)
            for new_link in new_links:
                if new_link not in url_list:
                    url_list.append(new_link)
            time.sleep(5)
            current_depth += 1
            print current_depth

Here is what I got when I ran it: 这是我运行它时得到的:

C:\Users\Hussam-Den\Desktop>python test.py https://www.yahoo.com/ 4
Entered URL:
https://www.yahoo.com/
Final URL:
https://www.yahoo.com/
*******************
1

And the output file for storing crawled urls is this one: 用于存储已爬网URL的输出文件是以下文件:

https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account

Any idea what's wrong? 知道有什么问题吗?

  1. You have an error here: depth = sys.argv[2] , sys return str not int . 您在这里遇到错误: depth = sys.argv[2]sys return str not int You should write depth = int(sys.argv[2]) 您应该写depth = int(sys.argv[2])
  2. Becouse of 1 point, condition while current_depth < depth: always return True 因为有1个点,所以while current_depth < depth:条件while current_depth < depth:始终返回True

Try to fix it by convert argv[2] to int . 尝试通过将argv[2]转换为int来修复它。 I thin error is there 我那里有错误

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM