简体   繁体   中英

How To Make A Web Crawler More Efficient?

Here is a code:

str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'

import urllib.request
from Stacks import Stack
import re
import functools
import operator as op
from nary_tree import *
url = 'http://www.activeingredients.com/'
s = set()
List = []
url_list = []
def f_go(List, s, url):
    try:
        if url in s:
            return
        s.add(url)
        with urllib.request.urlopen(url) as response:
            html = response.read()
            #print(url)
        h = html.decode("utf-8")
        lst0 = prepare_expression(list(h))
        ntr = buildNaryParseTree(lst0)
        lst2 = nary_tree_tolist(ntr)
        lst3= functools.reduce(op.add, lst2, [])
        str2 = ''.join(lst3)
        List.append(str2)
        f1 = re.finditer(str_regex, h)

        l1 = []
        for tok in f1:
            ind1 = tok.span()
            l1.append(h[ind1[0]:ind1[1]])
    for exp in l1:
        length = len(l1)
        if (exp[-1] == 'g' and exp[length - 2] == 'p' and exp[length - 3] == 'j')  or \
            (exp[-1] == 'p' and exp[length - 2] == 'n' and exp[length - 3] == 'g'):
                pass
        else:
            f_go(List, s, exp, iter_cnt + 1, url_list)
except:
    return

It basically, using, urlllib.request.urlopen, opens urls recursively in a loop; does tis in certain domain (in that case activeingredients.com); link extraction form a page is done by regexpression. Inside, having open page it parse it and add to a list as a string. So, what this is suppose to do is go through given domain, extract information (meaningful text in that case), add to a list. Try except block, just returns in the case of all the http errors (and all the rest errors too, but this is tested and working).
It works, for example, for this small page, but for bigger is extremely slow and eat memory.
Parsing, preparing page, more or less do the right job, I believe.
Question is, is there an efficient way to do this? How web searches crawl through network so fast?

First: I don't think Google's webcrawler is running on one laptop or one pc. So don't worry if you can't get results like big companies do.

Points to consider:

  1. You could start with a big list of words you can download from many websites. That sorts out some useless combinations of url's. After that you could crawl just with letters to get useless-named-sites on your index as well.

  2. You could start with a list of all registered domains on dns servers. IE something like this: http://www.registered-domains-list.com

  3. Use multiple threads

  4. Have much bandwidth

  5. Consider buying Google's Data-Center

These points are just ideas to give you a basic idea of how you could improve your crawler.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM