Crawl URL's with Scrapy which are stored in csv

Question

I am trying to implement a Scrapy Spider, which reads a csv file. The csv file will contain two columns like following:

1,google.com
2,microsoft.com
3,netflix.com
...

The spider should now store the full HTML Code of those sites in a specified directory and also insert the crawled url + the path to the stored HTML files into an JSON Array file.

So far I have found the following solution:

class RankingSpider(scrapy.Spider):
    name = 'non-xss'
    start_urls = []

    custom_settings = {
        'CLOSESPIDER_ITEMCOUNT': '50000',  # Nach x Itmes Crawler beenden
        'FILES_STORE': 'non-xss/html/',
        'METAREFRESH_ENABLED': False
    }

    def __init__(self, *args, **kwargs):
        super().__init__(**kwargs)
        with open('/home/marcel/Desktop/crawl/top-1m.csv', 'r') as f:
            reader = csv.reader(f)
            n = 0
            for row in reader:
                if n >= 0 and n < 10000:
                    self.start_urls.extend(['https://www.' + row[1] + '/'])
                    print(row[1])
                n += 1

    def parse(self, response):
        item = UmbrellaItem()
        filename = sha1(response.url.encode()).hexdigest()
        with open(self.custom_settings['FILES_STORE'] + filename, 'wb') as f:
            f.write(response.body)
        item['url'] = response.url
        item['file_path'] = self.custom_settings['FILES_STORE'] + filename
        return item

The solution does what I want it to do but it stops after a couple of seconds and then stalls. I am guessing that I run into issues cause of too many connections. I have also tried setting the settings.py in the scrapy projects like following:

RETRY_TIMES = 0
CONCURRENT_REQUESTS = 32

Does anyone have a more stable solution?

Thank you for any help you can give.

Answer 1

One way is with Scrapy.

Scrapy has the ability for you to send an http request. Here is the documentation for that: https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request

Another way is to use the Requests Library in Python. The docs for that can be found here: https://requests.readthedocs.io/en/master/ .

Crawl URL's with Scrapy which are stored in csv

Question

1 answers

solution1
0 2020-03-28 01:45:33

Crawl URL's with Scrapy which are stored in csv

Question

1 answers

solution1 0 2020-03-28 01:45:33

solution1
0 2020-03-28 01:45:33