简体   繁体   中英

How do I make the scrapy crawler not to agregate results exponentially

So I'm pretty new to Python and I'm trying to make a scrapy crawler to extract distributor data from a site. But I'm not getting the results I expected. This is my code:

class QuotesSpider(scrapy.Spider):
    name = "final_url"

    def start_requests(self):
        urls = [
       "https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/dealerslist/almagro/2675585174/?countrySelectorCode=AR"
        ]


        for url in urls:
             yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):

         urls_ = []
         for item in response.css('div.row.m-dealer_list__row'):

             half_urls_ = item.css('div.m-dealer_list__addr       a.link.trackingElement::attr(href)').getall()

            for half in half_urls_:
                 urls_.append('https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/' + half)

                with open('sub_urls.txt', 'a') as doc:
                    doc.write(str(urls_))

I expected a link (href) to each distributor -5 in this case- where I can extract name, address, mail, phone and site. Instead I get this confusing result:

['https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00077/almagro/colombo-fernando-javier/?countrySelectorCode=AR']
['https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00077/almagro/colombo-fernando-javier/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00417/almagro/easy-rivadavia-%28e164%29-cencosud/?countrySelectorCode=AR']
['https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00077/almagro/colombo-fernando-javier/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00417/almagro/easy-rivadavia-%28e164%29-cencosud/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00506/almagro/g-y-p-new-tree-s.a/?countrySelectorCode=AR']
['https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00077/almagro/colombo-fernando-javier/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00417/almagro/easy-rivadavia-%28e164%29-cencosud/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00506/almagro/g-y-p-new-tree-s.a/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00303/almagro/medrano-construcciones-s./?countrySelectorCode=AR']
['https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00077/almagro/colombo-fernando-javier/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00417/almagro/easy-rivadavia-%28e164%29-cencosud/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00506/almagro/g-y-p-new-tree-s.a/?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00303/almagro/medrano-construcciones-s./?countrySelectorCode=AR', 
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00304/almagro/medrano-construcciones-s.a./?countrySelectorCode=AR']

I thought this might be due to the 'a' mode in the .write function, but if I use 'w' I just get the last link. And this url I'm yielding is just one in over 700, so the initial .text created was quite large and useless.

Thanks in advance for any help you can provide. I feel this is some really dumb problem I'm just not seeing.

The line that is writing to your file is within there:

            for half in half_urls_:
                 urls_.append('https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/' + half)

                with open('sub_urls.txt', 'a') as doc:
                    doc.write(str(urls_))

Move it back a level of indentation. It's appending your full list of distributors, for each distributor, to the file.

Try it this way:

            for half in half_urls_:
                 urls_.append('https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/' + half)

            with open('sub_urls.txt', 'a') as doc:
                doc.write(str(urls_))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM