简体   繁体   English

无法使用scrapy获取元素

[英]Unable to fetch element using scrapy

I have wrote a spider to scrap a few elements from a website but the problem is i am unable to fetch some of the elements and some are working fine. 我写了一个Spider来从网站上删除一些元素,但是问题是我无法获取某些元素,而某些元素运行良好。 Please help me in right direction. 请帮助我正确的方向。

Here is my spider code: 这是我的蜘蛛代码:

from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ScrapyScraper.items import ScrapyscraperItem

class ScrapyscraperSpider(CrawlSpider) :
    name = "rs"
    allowed_domains = ["mega.pk"]
    start_urls = ["http://www.mega.pk/mobiles/"]

    rules = (
        Rule(SgmlLinkExtractor(allow = ("http://www\.mega\.pk/mobiles_products/[0-9]+\/[a-zA-Z-0-9.]+",)), callback = 'parse_item', follow = True),
    )

    def parse_item(self, response) :
        sel = Selector(response)
        item = ScrapyscraperItem()

        item['Heading'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[1]/h2/span/text()').extract()
        item['Content'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/p/text()').extract()
        item['Price'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[2]/div[1]/div[2]/span/text()').extract()
        item['WiFi'] = sel.xpath('//*[@id="laptop_detail"]/tbody/tr/td[contains(. ,"Wireless")]/text()').extract()

        return item

Now i am able to get Heading, Content and Price but Wifi returns nothing. 现在,我能够获得标题,内容和价格,但Wifi没有返回任何内容。 The point where i get totally confused is that the same xpath works in chrome and not in python(scrapy). 我完全感到困惑的是,相同的xpath在chrome中工作,而在python(scrapy)中工作。

I 'm still learning myself, though I think I may see your problem. 我仍然在学习自己,尽管我认为我可能会遇到您的问题。

I would imagine you are looking to find the wifi status - in which case you need the text of the span of the next element: 我想您正在寻找wifi状态-在这种情况下,您需要下一个元素的跨度文本:

import urllib2
import lxml.html as LH 

url = 'http://www.mega.pk/laptop_products/13242/Apple-MacBook-Pro-with-Retina-Display-Z0RG0000V.html'
response = urllib2.urlopen(url)
html = response.read()
doc=LH.fromstring(html)
heading = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[1]/h2/span/text()')
content = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/p/text()')
price = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[2]/div[1]/div[2]/span/text()')
wifi_location = doc.xpath('//*[@id="laptop_detail"]//tr/td[contains(. ,"Wireless")]')[0]
wifi_status = wifi_location.getnext().find('span').text

I only checked a single page, but hopefully this helps. 我只检查了一页,但希望对您有所帮助。 I am unsure why the xpath does not work.. I will be doing more reading but I often find that the inclusion of tbody does not function properly in this setting. 我不确定为什么xpath无法正常工作。我会做更多的阅读,但是我经常发现tbody的包含在此设置下无法正常工作。 I typically have opted to skip to td via //. 我通常选择通过//跳到td。

Edit 编辑

Found the reason, it looks like chrome will insert tbody when it is not included in original html. 找到原因,当原始html中不包含chrome时,chrome会插入tbody。 Scrapy is trying to parse the original HTML without this feature. Scrapy试图解析没有此功能的原始HTML。

Extracting lxml xpath for html table 为HTML表提取lxml xpath

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM