简体   繁体   中英

Web scraping using Scrapy adding extra elements during scraping process

I'm scraping a website looking for paragraphs in a specific place over a large amount of URLs. What I would like to do is record the URL I scraped 'next' to the scraped paragraph in a csv file for each URL I am visiting.

First I am making a list of all the websites I want to scrape using the search syntax for the website. I am searching for books by ISBN number. What I am currently yielding is a list of scraped paragraphs just like I wanted...However it is occasionally not working, and so I can't simply concatenate the scraped paragraphs with the list of ISBNs that I have after the fact because they don't line up perfectly.

I tried putting some code inside the 'yield' brackets to no avail. Any ideas, or other scrapy suggestions?

starts = []
for isbn in data:
    starts.append('https://www.********.com/search?q=' + isbn)

import scrapy
from scrapy.crawler import CrawlerProcess

class ESSpider(scrapy.Spider):
    name = "ESS"
    start_urls = starts

    def parse(self, response):
        for article in response.xpath('//html'):

                yield {
                    'text': article.xpath('body/div[@class="content"]/div[@class="mainContentContainer "]/div[@class="mainContent "]/div[@class="mainContentFloat "]/div[@class="leftContainer"]/div[@id="topcol"]/div[@id="metacol"]/div[@id="descriptionContainer"]//span/text()').extract(),
                }

process = CrawlerProcess({
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'blurbs2.csv', 
    'LOG_ENABLED': False,   
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': ********,
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'DOWNLOAD_DELAY' : 1
})

process.crawl(ESSpider)
process.start()

If you want to get an URL:

def parse(self, response):
    for article in response.xpath('//html'):

            item = {
                'text': article.xpath('body/div[@class="content"]/div[@class="mainContentContainer "]/div[@class="mainContent "]/div[@class="mainContentFloat "]/div[@class="leftContainer"]/div[@id="topcol"]/div[@id="metacol"]/div[@id="descriptionContainer"]//span/text()').extract(),
                'url': response.url,
            }
            yield item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM