简体   繁体   中英

Using a Loop to enter values in “start_urls” function to input values from a csv

I basically have a list of titles to search on a website which are stored in a csv.

I'm extracting those values and then trying to add append them to the search link in the start_urls function.

However, when I run the script, it only takes the last value of the list. Is there any particular reason why this happens?

class MySpider(CrawlSpider):
      name = "test"
      allowed_domains = ["example.com"]
      import pandas as pd
      df = pd.read_csv('test.csv')
      saved_column = df.ProductName
      for a in saved_column:
        start_urls = ["http://www.example.com/search?noOfResults=20&keyword="+str(a)"]

      def parse(self,response):

There is a conceptual error in your code. You are making the loop but without any action other than rotating the urls. So the parse function is called with the last value of the loop.

A possible other approach would be to override 'start_requests' method of the spider:

def start_requests(self):
    df = pd.read_csv('test.csv')
    saved_column = df.ProductName
    for url in saved_column:
        yield Request(url, self.parse)

Idea got from here: How to generate the start_urls dynamically in crawling?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM