简体   繁体   中英

Web scraping stock details from Business Insider using Scrapy

I'm trying to pull the 'Name', 'Latest Price', and '%' fields for each stock from the following site: https://markets.businessinsider.com/index/components/s&p_500

However, I get no data scraped even though I've confirmed that my XPaths work in the Chrome console for those fields.

For reference, I've been using this guide: https://realpython.com/web-scraping-with-scrapy-and-mongodb/


from scrapy.item import Item, Field

class InvestmentItem(Item):
    ticker = Field()
    name = Field()
    px = Field()
    pct = Field()


from scrapy import Spider
from scrapy.selector import Selector
from investment.items import InvestmentItem

class InvestmentSpider(Spider):
    name = "investment"
    allowed_domains = ["markets.businessinsider.com"]
    start_urls = [

    def parse(self, response):
        stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr')

        for stock in stocks:
            item = InvestmentItem()
            item['name'] = stock.xpath('td[1]/a/text()').extract()[0]
            item['px'] = stock.xpath('td[2]/text()[1]').extract()[0]
            item['pct'] = stock.xpath('td[5]/span[2]').extract()[0]

            yield item

output from console:

2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)

You have missing the "./" at the begging of the xpath expression. I have simplified your xpaths:

def parse(self, response):
    stocks = response.xpath('//table[@class="table table-small"]/tr')

    for stock in stocks[1:]:
        item = InvestmentItem()
        item['name'] = stock.xpath('./td[1]/a/text()').get()
        item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()
        item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()

        yield item

XPATH version

    def parse(self, response):

        rows = response.xpath('//*[@id="index-list-container"]/div[2]/table/tr')
        for row in rows:
                'name' : row.xpath('td[1]/a/text()').extract(),

CSS version

    def parse(self, response):

        table = response.css('div#index-list-container table.table-small') 
        rows = table.css('tr') 

        for row in rows:
            name = row.css("a::text").get()
            high_low = row.css('td:nth-child(2)::text').get()
            date_time = row.css('td:nth-child(7) span:nth-child(2) ::text').get()

            yield {      
                'name' : name, 
                'high_low': high_low,
                'date_time' : date_time                


{"high_low": "\r\n146.44", "name": "3M", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n42.22", "name": "AO Smith", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n91.47", "name": "Abbott Laboratories", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n92.10", "name": "AbbVie", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n193.71", "name": "Accenture", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n73.08", "name": "Activision Blizzard", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},
{"high_low": "\r\n385.26", "name": "Adobe", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},
{"high_low": "\r\n133.48", "name": "Advance Auto Parts", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM