简体   繁体   English

将HTML刮成CSV

[英]Scraping HTML into CSV

I want to extract the contents like side-effects, warning, dosage from the site mentioned in the start urls. 我想从起始网址中提到的站点中提取诸如副作用,警告,剂量之类的内容。 The following is my code. 以下是我的代码。 The csv file is getting created but nothing is displayed. 正在创建csv文件,但未显示任何内容。 The output is: 输出为:

before for
[] # it is displaying empty list
after for
This is my code: 这是我的代码:
 from scrapy.selector import Selector from medicinelist_sample.items import MedicinelistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class MedSpider(CrawlSpider): name = "med" allowed_domains = ["medindia.net"] start_urls = ["http://www.medindia.net/doctors/drug_information/home.asp?alpha=z"] rules = [Rule(SgmlLinkExtractor(allow=('Zafirlukast.htm',)), callback="parse", follow = True),] global Selector def parse(self, response): hxs = Selector(response) fullDesc = hxs.xpath('//div[@class="report-content"]//b/text()') final = fullDesc.extract() print "before for" # this is just to see if it was printing print final print "after for" # this is just to see if it was printing 

Your scrapy spider class's parse method should return item(s) . 您的scrapy spider类的parse方法应return item(s) With the current code, I do not see any item being returned. 使用当前代码,我看不到任何项目被退回。 An example would be, 一个例子是

def parse_item(self, response):
    self.log('Hi, this is an item page! %s' % response.url)

    sel = Selector(response)
    item = Item()
    item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
    return item

For more information, take a look at the CrawlSpider example in the official scrapy docs . 有关更多信息,请查看官方scrapy docs中的CrawlSpider示例

Another problem in your code is that you are overriding the CrawlSpider's parse method to implement callback logic. 代码中的另一个问题是,您将重写CrawlSpider的parse方法以实现回调逻辑。 This mustn't be done with CrawlSpiders since the parse method is used in its logic. 不能使用CrawlSpiders进行此操作,因为在其逻辑中使用了parse方法。

Ashish Nitin Patil has implicitly noted that already by naming his example function *parse_item*. Ashish Nitin Patil已通过命名示例函数* parse_item *隐式指出了这一点。

What the default implementation of a Crawl Spider's parse method basically does is to call the callbacks, that you've specified in the rule definitions; 爬网蜘蛛的parse方法的默认实现基本上执行的是调用您在规则定义中指定的回调。 so if you override it, I think your callbacks won't be called at all. 因此,如果您覆盖它,我认为根本不会调用您的回调。 See Scrapy Doc - crawling rules 请参阅Scrapy Doc-爬网规则

I just have experimented a bit with the site that you are crawling. 我只是对您正在爬网的站点做了一些实验。 As you would like to extract some data about the medicine (like the name, indications, contraindications, etc.) out the different sites on this domain: Wouldn't the following or similar XPath expressions fit your needs? 您想从该域的不同站点上提取有关该药物的一些数据(例如名称,适应症,禁忌症等):以下或类似的XPath表达式不符合您的需求吗? I think your current query would give you just the "headers", but the actual info on this site is in the textnodes that follow those bold-rendered headers. 我认为您当前的查询只会给您“标题”,但是此站点上的实际信息位于这些粗体显示的标题之后的textnodes中。

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Test.items import TestItem

from scrapy.item import Item, Field

class Medicine(Item):
    name = Field()
    dosage = Field()
    indications = Field()
    contraindications = Field()
    warnings = Field()

class TestmedSpider(CrawlSpider):
    name = 'testmed'
    allowed_domains = ['http://www.medindia.net/doctors/drug_information/']
    start_urls = ['http://www.http://www.medindia.net/doctors/drug_information/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'Zafirlukast.htm'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        drug_info = Medicine()

        selector = Selector(response)
        name = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Generic Name')]//..//following-sibling::text()[1])''')
        dosage = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Dosage')]//..//following-sibling::text()[1])''')
        indications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Why it is prescribed (Indications)')]//..//following-sibling::text()[1])''')
        contraindications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Contraindications')]//..//following-sibling::text()[1])''')
        warnings = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Warnings and Precautions')]//..//following-sibling::text()[1])''')

        drug_info['name'] = name.extract()
        drug_info['dosage'] = dosage.extract()
        drug_info['indications'] = indications.extract()
        drug_info['contraindications'] = contraindications.extract()
        drug_info['warnings'] = warnings.extract()

        return drug_info

This would give you the following infos: 这将为您提供以下信息:

>scrapy parse --spider=testmed --verbose -d 2 -c parse_item --logfile C:\Python27\Scripts\Test\Test\spiders\test.log http://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>> DEPTH LEVEL: 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'contraindications': [u'Hypersensitivity.'],
  'dosage': [u'Adult- The recommended dose is 20 mg twice daily.'],
  'indications': [u'This medication is an oral leukotriene receptor antagonist (
LTRA), prescribed for asthma. \xa0It blocks the action of certain natural substa
nces that cause swelling and tightening of the airways.'],
  'name': [u'\xa0Zafirlukast'],
  'warnings': [u'Caution should be exercised in patients with history of liver d
isease, mental problems, suicidal thoughts, any allergy, elderly, during pregnan
cy and breastfeeding.']}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM