简体   繁体   中英

Scrapy Spider cannot Extract contents of web page using xpath

I have scrapy spider and i am using xpath selectors to extract the contents of the page,kindly check where i am going wrong

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from medicalproject.items import MedicalprojectItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector 
from scrapy import Request  


class MySpider(CrawlSpider):
      name = "medical"
      allowed_domains = ["yananow.org"]
      start_urls = ["http://yananow.org/query_stories.php"]

rules = (
    Rule(SgmlLinkExtractor(allow=[r'display_story.php\?\id\=\d+']),callback='parse_page',follow=True),       
    )

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td')
    items = []
    for title in titles:
        item = MedicalprojectItem()
        item["patient_name"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/img[1]/text()").extract()
        item["stories"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/div/font/p/text()").extract()
        items.append(item)
    return(items) 

There are a lot of issues with your code so here is a different approach.

I opted against a CrawlSpider to have more control over the scraping process. Especially with grabbing the name from the query page and the story from a detail page.

I tried to simplify the XPath statements by not diving into the (nested) table structures but looking for patterns of content. So if you want to extract a story ... there must be a link to a story.

Here comes the tested code (with comments):

# -*- coding: utf-8 -*-
import scrapy

class MyItem(scrapy.Item):
    name = scrapy.Field()
    story = scrapy.Field()

class MySpider(scrapy.Spider):

    name = 'medical'
    allowed_domains = ['yananow.org']
    start_urls = ['http://yananow.org/query_stories.php']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"display_story")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the story (text)
        myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM