简体   繁体   中英

scrapy start_urls from txt file

i have around 100K urls to scrape so i want to read them from a txt file here is the code

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

class ConadstoresSpider(scrapy.Spider):
    name = 'conadstores'
    headers = {'user_agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
    allowed_domains = ['conad.it']
    #start_urls = ['http://www.conad.it/ricerca-negozi/negozio.002781.html','https://www.conad.it/ricerca-negozi/negozio.006804.html']
    #start_urls = [l.strip() for l in open("/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt").readlines()]
    #f = open("/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt")
    #start_urls = [url.strip() for url in f.readlines()]
    #f.close()

    with open('/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt') as file:
        start_urls = [line.strip() for line in file]


    def start_request(self):
        request = Request(url = self.start_urls, callback=self.parse)
        yield request

    def parse(self, response):
        yield {
            'address' : response.css('.address-oswald::text').extract(),
            'phone' : response.css('span.phone::text').extract(),

        }

but i keep on getting this error

2021-12-08 13:27:48 [scrapy.core.engine] ERROR: Error while obtaining start requests Traceback (most recent call last): File "/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/core/engine.py", line 127, in _next_request request = next(slot.start_requests) File "/Users/macbook/PycharmProjects/conad/conad/conad/middlewares.py", line 52, in process_start_requests for r in start_requests: File "/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/spiders/ init .py", line 83, in start_requests yield Request(url, dont_filter=True) File "/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/http/request/ init .py", line 25, in init self._set_url(url) File "/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/http/request/ init .py", line 62, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Mis sing scheme in request url: %7B%5Crtf1%5Cansi%5Cansicpg1252%5Ccocoartf2580

any idea? thanks!

We can override the start_urls logic in spider's start_requests() method

this is simple way to extract your data

import scrapy


class ConadstoresSpider(scrapy.Spider):
    name = 'conadstores'

    def start_requests(self):
        # read file data (you can use different logic for extract URLS from text files)
        a_file = open("/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt")
        file_contents = a_file.read()
        contents_split = file_contents.splitlines()
        # extract urls from text file and store in list
        for url in contents_split:
            # send request to extracted URL.
            yield scrapy.Request(url)

    def parse(self, response, **kwargs):
        yield {
            'address': response.css('.address-oswald::text').extract(),
            'phone': response.css('span.phone::text').extract(),

        }

you can use different file reading logic but make sure that it's return url list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM