Pass Scrapy Spider a list of URLs to crawl via .txt file

Question

I'm a little new to Python and very new to Scrapy.

I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable.

For exmaple:

class LinkChecker(BaseSpider):
    name = 'linkchecker'
    start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line.

I've done a little bit of research and keep coming up empty handed. I've seen this type of example ( How to pass a user defined argument in scrapy spider ), but I don't think that will work for a passing a text file.

Answer 1

Run your spider with -a option like:

scrapy crawl myspider -a filename=text.txt

Then read the file in the __init__ method of the spider and define start_urls :

class MySpider(BaseSpider):
    name = 'myspider'

    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as f:
                self.start_urls = f.readlines()

Hope that helps.

Answer 2

you could simply read-in the .txt file:

with open('your_file.txt') as f:
    start_urls = f.readlines()

if you end up with trailing newline characters, try:

with open('your_file.txt') as f:
    start_urls = [url.strip() for url in f.readlines()]

Hope this helps

Answer 3

If your urls are line seperated

def get_urls(filename):
        f = open(filename).read().split()
        urls = []
        for i in f:
                urls.append(i)
        return urls

then this lines of code will give you the urls.

Answer 4

class MySpider(scrapy.Spider):
    name = 'nameofspider'

    def __init__(self, filename=None):
        if filename:
            with open('your_file.txt') as f:
                self.start_urls = [url.strip() for url in f.readlines()]

This will be your code. It will pick up the urls from the .txt file if they are separated by lines, like, url1 url2 etc..

After this run the command -->

scrapy crawl nameofspider -a filename=filename.txt

Lets say, your filename is 'file.txt', then, run the command -->

scrapy crawl myspider -a filename=file.txt

Pass Scrapy Spider a list of URLs to crawl via .txt file

Question

4 answers

solution1
16 ACCPTED 2013-06-25 21:21:47

solution2
3 2013-06-25 21:21:36

solution3
2 2013-06-25 21:31:34

solution4
1 2017-06-16 18:15:38

Pass Scrapy Spider a list of URLs to crawl via .txt file

Question

4 answers

solution1 16 ACCPTED 2013-06-25 21:21:47

solution2 3 2013-06-25 21:21:36

solution3 2 2013-06-25 21:31:34

solution4 1 2017-06-16 18:15:38

solution1
16 ACCPTED 2013-06-25 21:21:47

solution2
3 2013-06-25 21:21:36

solution3
2 2013-06-25 21:31:34

solution4
1 2017-06-16 18:15:38