简体   繁体   中英

How to read start_urls from csv file in scrapy?

I have two spiders. Let's say A and B. A scrapes bunch of urls and writes it into a csv file and B scrapes inside those urls reading from the csv file generated by A. But it throws FileNotFound error from B before A can actually create the file. How can I make my spiders behave such that B waits until A comes back with url? Any other solution would be helpful.

WriteToCsv.py file

def write_to_csv(item):
    with open('urls.csv', 'a', newline='') as csvfile:
        fieldnames = ['url']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writerow({'url': item})


class WriteToCsv(object):
    def process_item(self, item, spider):
        if item['url']:
            write_to_csv("http://pypi.org" +item["url"])
        return item

Pipelines.py file

ITEM_PIPELINES = {
    'PyPi.WriteToCsv.WriteToCsv': 100,
    'PyPi.pipelines.PypiPipeline': 300,
}

read_csv method

def read_csv():                   
x = []
with open('urls.csv', 'r') as csvFile:
    reader = csv.reader(csvFile)
    for row in reader:
        x = [''.join(url) for url in reader]
return x

start_urls in B spider file

start_urls = read_csv() #Error here

I would consider using a single spider with two methods parse and final_parse . As far as I can tell from the context you have provided there is no need to write the URLs to disk.

parse should contain the logic for scraping the URLs that spider A is currently writing to the csv and should return a new request with a callback to the final_parse method.

def parse(self, response):
    url = do_something(response.body_as_unicode())
    return scrapy.Request(url, callback=self.final_parse)

final_parse should then contain the parsing logic that was previously in spider B.

def final_parse(self, response):
    item = do_something_else(response.body_as_unicode())
    return item

Note: If you need to pass any additional information from parse to final_parse you can use the meta argument of scrapy.Request .

If you do need the URLs, you could add this as a field to your item. It can be accessed with response.url .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM