简体   繁体   English

适用于网络爬虫的Python beautifulsoup性能,IDE,池大小,多处理请求

[英]Python beautifulsoup performance, IDE, poolsize, multiprocessing requests.get for web-crawler

Fine fellows of stack overflow! 堆栈溢出的好伙伴!

New at python(second day), Beautiful Soup and sadly, in a bit of hurry. python(第二天)的新功能,Beautiful Soup,可悲的是,有点着急。

I've built a crawler that take street names from a file into a search engine(merinfo_url). 我构建了一个爬网程序,将街道名称从文件导入搜索引擎(merinfo_url)。 Those companies under right conditions get further scraped then exported. 在适当条件下的那些公司将被进一步报废然后出口。

I'm in a "hurry" because, despite a complete debug mess of a code everything is working! 我很着急,因为尽管对代码进行了完整的调试,但一切正常! I'm itching to begin a long debug test on a remote computer today. 我很想在今天开始在远程计算机上进行长时间的调试测试。 I stopped at 5000 hits. 我停在5000次点击。

But performance is slow. 但是性能很慢。 I understand I could change the parser to lxml, and open my local file only once. 我知道我可以将解析器更改为lxml,并且只打开一次本地文件。

I hope to implement that today. 我希望今天实现。

Multiprocessing however confuses me. 但是, 多处理使我感到困惑。 What's my best option, a pool or open several connections? 我最好的选择是池还是打开多个连接? Am I using two terms for the same call? 我在同一通话中使用两个词吗?

How large of a pool? 一个游泳池有多大? Two per thread seems to be frequent advise but I've seen a hundred on a local machine. 每个线程两个似乎是经常的建议,但是我在本地计算机上看到了一百个。 Any general rule? 有任何一般规则吗?

If I change nothing in my current code, w here do I implement the pool and how do you do it generally for the requests object ? 如果我在当前代码中未进行任何更改,则在这里我将实现该池,并且通常如何对请求对象 执行该池

Finally; 最后; in terms of performance, on the top of your heads, good performing IDE to debug a crawler running on a local machine? 在性能方面,最重要的是,性能良好的IDE可以调试在本地计算机上运行的搜寻器?

Many thanks for any feedback offered! 非常感谢您提供任何反馈!

Code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import json
import requests
import urllib
from bs4 import BeautifulSoup
from time import sleep
from sys import exit
from multiprocessing import Pool


def main():
  if __debug__:
    print ("[debug] Instaellningar:")
    print ("[debug]    Antal anstaellda: "+(antal_anstaellda))
    print ("[debug]    Omsaettning: "+(omsaettning))

  with open(save_file, 'wb') as filedescriptor:
    filedescriptor.write('companyName,companySSN,companyAddressNo,companyZipCity,phoneNumber,phoneProvider,phoneNumberType\n')

  lines = [line.rstrip('\n') for line in open(streetname_datfile)]
  for adresssokparameter in lines:

    searchparams = { 'emp': antal_anstaellda, 'rev': omsaettning, 'd': 'c', 'who': '', 'where': adresssokparameter, 'bf': '1' }
    sokparametrar = urllib.urlencode(searchparams)


     merinfo_url = merinfobaseurl+searchurl+sokparametrar

    if __debug__:
      print ("[debug] Antal requests gjorda till merinfo.se: "+str(numberOfRequestsCounter))
    crawla_merinfo(merinfo_url)

      # Crawler
    def crawla_merinfo(url):

  if __debug__:
    print ("[debug] crawl url: "+url)
  global numberOfRequestsCounter
  numberOfRequestsCounter += 1
  merinfosearchresponse = requests.get(url, proxies=proxies)
  if merinfosearchresponse.status_code == 429:
    print ("[!] For manga sokningar, avslutar")
    exit(1)
  merinfosoup = BeautifulSoup(merinfosearchresponse.content, 'html.parser')
  notfound = merinfosoup.find(string=merinfo404text)
  if notfound == u"Din sokning gav tyvaerr ingen traeff. Prova att formulera om din sokning.":
    if __debug__:
      print ("[debug] [!] " + merinfo404text)
    return
  for merinfocompanycontent in merinfosoup.find_all('div', attrs={'class': 'result-company'}):
    phonelink = merinfocompanycontent.find('a', attrs={'class': 'phone'})
    if phonelink == None:
      # No numbers, do nothing
      if __debug__:
        print ("[!] Inget telefonnummer for foretaget")
      return
    else:
      companywithphonenolink = merinfobaseurl+phonelink['href']
      thiscompanyphonenodict = crawla_merinfo_telefonnummer(companywithphonenolink)
      companyName = merinfocompanycontent.find('h2', attrs={'class': 'name'}).find('a').string
      companySSN = merinfocompanycontent.find('p', attrs={'class': 'ssn'}).string
      companyAddress = merinfocompanycontent.find('p', attrs={'class': 'address'}).text
      splitAddress = companyAddress.splitlines()
      addressStreetNo = splitAddress[0]
      addressZipCity = splitAddress[1]
      addressStreetNo.encode('utf-8')
      addressZipCity.encode('utf-8')

      if __debug__:
        print ("[debug] [*] Foretaget '"+companyName.encode('utf-8')+("' har telefonnummer..."))
      for companyPhoneNumber in thiscompanyphonenodict.iterkeys():
        companyRow = companyName+","+companySSN+","+addressStreetNo+","+addressZipCity+","+thiscompanyphonenodict[companyPhoneNumber]
        if __debug__:
          print ("[debug] ::: "+thiscompanyphonenodict[companyPhoneNumber])
        with open(save_file, 'a') as filedescriptor:
          filedescriptor.write(companyRow.encode('utf-8')+'\n')
  return

  #telephone crawl function
  def crawla_merinfo_telefonnummer(url):
  global numberOfRequestsCounter
  numberOfRequestsCounter += 1
  if __debug__:
    print ("[debug] crawl telephone url: "+url)
  phonenoDict = {}
  s = requests.session()
  merinfophonenoresponse = s.get(url, timeout=60)
  merinfophonenosoup = BeautifulSoup(merinfophonenoresponse.content, 'html.parser')
  merinfotokeninfo = merinfophonenosoup.find('meta', attrs={'name': '_token'})
  headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063',
    'X-Requested-With': 'XMLHttpRequest',
    'Host': 'www.merinfo.se',
    'Referer': url
  }
  headers['X-CSRF-TOKEN'] = merinfotokeninfo['content']
  headers['Cookie'] = 'merinfo_session='+s.cookies['merinfo_session']+';'

  merinfophonetable = merinfophonenosoup.find('table', id='phonetable')
  i = 0
  for merinfophonenoentry in merinfophonetable.find_all('tr', id=True):
    i += 1
    phoneNumberID = merinfophonenoentry['id']
    phoneNumberPhoneNo = merinfophonenoentry['number']

    for phoneNumberColumn in merinfophonenoentry.find_all('td', attrs={'class':'col-xs-2'}):
      phoneNumberType = phoneNumberColumn.next_element.string.replace(",",";")
      phoneNumberType = phoneNumberType.rstrip('\n').lstrip('\n')

    payload = {
      'id': phoneNumberID,
      'phonenumber': phoneNumberPhoneNo
    }
    r = s.post(ajaxurl, data=payload, headers=headers)
    numberOfRequestsCounter += 1
    if r.status_code != 200:
     print ("[!] Error, response not HTTP 200 while querying AJAX carrier info.")
     exit(1)
    else:
      carrierResponseDict = json.loads(r.text)
      # print carrierResponseDict['operator']
      phoneNoString = phoneNumberPhoneNo+','+carrierResponseDict['operator']+','+phoneNumberType
      phonenoDict['companyPhoneNo'+str(i)] = phoneNoString
  return phonenoDict

# Start main program
main()

You should start to use Scrapy 您应该开始使用Scrapy

One of the main advantages about Scrapy: requests are scheduled and processed asynchronously. Scrapy 的主要优点之一:请求是异步调度和处理的。

This means that Scrapy doesn't need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. 这意味着Scrapy无需等待请求完成和处理,它可以同时发送另一个请求或执行其他操作。 This also means that other requests can keep going even if some request fails or an error happens while handling it. 这也意味着即使某些请求失败或在处理过程中发生错误,其他请求也可以继续执行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM