简体   繁体   中英

Crawl URL's with Scrapy which are stored in csv

I am trying to implement a Scrapy Spider, which reads a csv file. The csv file will contain two columns like following:

1,google.com
2,microsoft.com
3,netflix.com
...

The spider should now store the full HTML Code of those sites in a specified directory and also insert the crawled url + the path to the stored HTML files into an JSON Array file.

So far I have found the following solution:

class RankingSpider(scrapy.Spider):
    name = 'non-xss'
    start_urls = []

    custom_settings = {
        'CLOSESPIDER_ITEMCOUNT': '50000',  # Nach x Itmes Crawler beenden
        'FILES_STORE': 'non-xss/html/',
        'METAREFRESH_ENABLED': False
    }

    def __init__(self, *args, **kwargs):
        super().__init__(**kwargs)
        with open('/home/marcel/Desktop/crawl/top-1m.csv', 'r') as f:
            reader = csv.reader(f)
            n = 0
            for row in reader:
                if n >= 0 and n < 10000:
                    self.start_urls.extend(['https://www.' + row[1] + '/'])
                    print(row[1])
                n += 1

    def parse(self, response):
        item = UmbrellaItem()
        filename = sha1(response.url.encode()).hexdigest()
        with open(self.custom_settings['FILES_STORE'] + filename, 'wb') as f:
            f.write(response.body)
        item['url'] = response.url
        item['file_path'] = self.custom_settings['FILES_STORE'] + filename
        return item

The solution does what I want it to do but it stops after a couple of seconds and then stalls. I am guessing that I run into issues cause of too many connections. I have also tried setting the settings.py in the scrapy projects like following:

RETRY_TIMES = 0
CONCURRENT_REQUESTS = 32

Does anyone have a more stable solution?

Thank you for any help you can give.

One way is with Scrapy.

Scrapy has the ability for you to send an http request. Here is the documentation for that: https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request

Another way is to use the Requests Library in Python. The docs for that can be found here: https://requests.readthedocs.io/en/master/ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM