簡體   English   中英

為什么我的網絡爬蟲只能工作一半時間?

[英]Why does my web scraper only work half the time?

我的目標是獲取在我提供給我的程序的任何網站中檢測到的所有亞馬遜頁面的產品名稱和價格。

我的輸入是一個包含五個網站的文本文件。 在這些網站中的每一個中,總共可以找到 5 到 15 個亞馬遜鏈接。

我的代碼是這樣的:

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
import requests
import re
from bs4 import BeautifulSoup
from collections import OrderedDict
from time import sleep
import time
from lxml import html
import json
from urllib2 import Request, urlopen, HTTPError, URLError

def isdead(url):
    user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent':user_agent }
    req = Request(url, headers = headers)
    sleep(10)
    try:
        page_open = urlopen(req)
    except HTTPError, e:
        return e.code #404 if link is broken
    except URLError, e:
        return e.reason
    else:
        return False

def check(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    page = requests.get(url, headers = headers)

    doc = html.fromstring(page.content)
    XPATH_AVAILABILITY = '//div[@id ="availability"]//text()'
    RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)
    AVAILABILITY = ''.join(RAw_AVAILABILITY).strip()
    #re.... is a list. if empty, available. if not, unavailable.
    #return re.findall(r'Available from',AVAILABILITY[:30], re.IGNORECASE)

    if len(re.findall(r'unavailable',AVAILABILITY[:30],re.IGNORECASE)) == 1:
        return "unavailable"
    else:
        return "available"


file_name = raw_input("Enter file name: ")
filepath = "%s"%(file_name)

with open(filepath) as f:
    listoflinks = [line.rstrip('\n') for line in f]

all_links = []
for i in listoflinks:
    htmls = req.get(i)
    doc = SimplifiedDoc(htmls)
    amazon_links = doc.getElements('a')
    amazon_links = amazon_links.containsOr(['https://www.amazon.com/','https://amzn.to/'],attr='href')
    for a in amazon_links:
        if a.href not in all_links:
            all_links.append(a.href)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

all_links = [x for x in all_links if "amazon.com/gp/prime" not in x]
all_links = [y for y in all_links if "amazon.com/product-reviews" not in y]
for i in all_links:
    print "LINK:"
    print i
    response = requests.get(i, headers=headers)
    soup = BeautifulSoup(response.content, features="lxml")

    if isdead(i) == 404:
        print "DOES NOT EXIST"
        print "/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/"
        pass
    else:
        title = soup.select("#productTitle")[0].get_text().strip()
        if check(i) == "unavailable":
            price = "UNAVAILABLE"
        else:
            if (len(soup.select("#priceblock_ourprice")) == 0) and (len(soup.select("#priceblock_saleprice")) == 0):
                price = soup.select("#a-offscreen")
            elif len(soup.select("#priceblock_ourprice")) == 0:
                price = soup.select("#priceblock_saleprice")
            else:
                price = soup.select("#priceblock_ourprice")

        print "TITLE:%s"%(title)
        print "PRICE:%s"%(price)
        print "/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/"

print "..............................................."
print "FINALLY..."
print "# OF LINKS RETRIEVED:"
print len(all_links)

每當它工作正常時,輸出看起來像這樣(請不要判斷 PRICE 輸出,我花了很多時間試圖解決這個問題,但沒有任何效果,因為我無法將其轉換為字符串,而 get_text() 不會” t work。這個項目僅供個人使用,所以它不是那么重要,但如果你有建議,我很樂意接受。):

LINK:
https://www.amazon.com/dp/B007Y6LLTM/ref=as_li_ss_tl?ie=UTF8&linkCode=ll1&tag=lunagtkf1-20&linkId=ee8c5299508af57c815ea6577ede4244
TITLE:Moen 7594ESRS Arbor Motionsense Two-Sensor Touchless One-Handle Pulldown Kitchen Faucet Featuring Power Clean, Spot Resist Stainless
PRICE:[<span class="a-size-medium a-color-price priceBlockBuyingPriceString" id="priceblock_ourprice">$359.99</span>]
/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

... 等等。 錯誤如下所示:

Traceback (most recent call last):
File "name.py", line 75, in <module>
title = soup.select("#productTitle")[0].get_text().strip()
IndexError: list index out of range

這太奇怪了,因為有一個文本文件被輸入了很多次,有時,所有網站都被抓取得很好,但有時,錯誤出現在第 10 個亞馬遜產品,有時,錯誤出現在第 1 個產品......

我懷疑這是機器人檢測問題,但我有一個標題。 有什么問題?

你的代碼太亂了。 我已經為你整理好了,請看看它是否有效。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
import requests

file_name = raw_input("Enter file name: ")
filepath = "%s"%(file_name)

with open(filepath) as f:
    listoflinks = [line.rstrip('\n') for line in f]

all_links = []
for i in listoflinks:
    htmls = req.get(i)
    doc = SimplifiedDoc(htmls)
    amazon_links = doc.getElements('a')
    amazon_links = amazon_links.containsOr(['https://www.amazon.com/','https://amzn.to/'],attr='href')
    amazon_links = amazon_links.notContains(['amazon.com/gp/prime','amazon.com/product-reviews'],attr='href')
    for a in amazon_links:
        if a.href not in all_links:
            all_links.append(a.href)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

for i in all_links:
    print "LINK:"
    print i
    response = requests.get(i, headers=headers)
    if response.status_code == 404:
        print "DOES NOT EXIST"
        print "/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/"
        pass
    else:
        html = response.text
        doc = SimplifiedDoc(html)
        title = doc.getElementByID("productTitle").text
        if doc.getElementByID('availability') and doc.getElementByID('availability').text.find('unavailable')>0:
            price = "UNAVAILABLE"
        else:
            if doc.getElementByID("priceblock_ourprice"):
                price = doc.getElementByID("priceblock_ourprice").text
            elif doc.getElementByID("priceblock_saleprice"):
                price = doc.getElementByID("priceblock_saleprice").text
            else:
                price = doc.getElementByID("a-offscreen").text

        print "TITLE:%s"%(title)
        print "PRICE:%s"%(price)
        print "/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/"

print "..............................................."
print "FINALLY..."
print "# OF LINKS RETRIEVED:"
print len(all_links)

你應該了解更多:) 並給你一個使用框架的例子。 這里有simplified_scrapy更多的例子在這里

如果您需要任何幫助,請告訴我。

from simplified_scrapy.spider import Spider, SimplifiedDoc
class MySpider(Spider):
  name = 'amazon-product'
  # allowed_domains = ['example.com']
  start_urls = []
  refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.

  filepath='' # Your file path
  if filepath:
    with open(filepath) as f:
      start_urls = [line.rstrip('\n') for line in f]

  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    amazon_links=None
    data = None
    if url['url'].find('https://www.amazon.com')>=0 or url['url'].find('https://amzn.to')>=0:
      title = doc.getElementByID("productTitle").text
      if doc.getElementByID('availability') and doc.getElementByID('availability').text.find('unavailable')>0:
        price = "UNAVAILABLE"
      else:
        if doc.getElementByID("priceblock_ourprice"):
          price = doc.getElementByID("priceblock_ourprice").text
        elif doc.getElementByID("priceblock_saleprice"):
          price = doc.getElementByID("priceblock_saleprice").text
        else:
          price = doc.getElementByID("a-offscreen").text

      data = [{"title":title,'price':price}] # Get target data
      print "TITLE:%s"%(title)
      print "PRICE:%s"%(price)
      print "/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/"
    else:
      amazon_links = doc.getElements('a')
      amazon_links = amazon_links.containsOr(['https://www.amazon.com/','https://amzn.to/'],attr='href')
      amazon_links = amazon_links.notContains(['amazon.com/gp/prime','amazon.com/product-reviews'],attr='href')
    return {"Urls": amazon_links, "Data": data} # Return data to framework

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(MySpider()) # Start crawling

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM