简体   繁体   中英

How can I scrape multiple pages with scrapy in my python code?

So I am currently making a scraper project for this one website: https://www.datacenters.com/locations?page=1&per_page=40&query=&withProducts=false&showHidden=false&nearby=false&radius=0&bounds=&circleBounds=&polygonPath= . It goes through all the different data center locations and prints out a csv (done through vs code and uses terminal command scrapy crawl datacenters -o datacenters.csv to run). Maybe I should be doing a JSON file instead? Contemplating using pandas as well. For some reason, no matter what I change, I can't get my code to scrape more than the first page. I would appreciate any help at all, thanks. I just need to know what else to edit/add so I can scrape most if not all pages, possibly make a loop?

import scrapy
import pandas as pd

class DatacentersSpider(scrapy.Spider):
  name = 'datacenters'
  allowed_domains = ['datacenters.com']
  start_urls = ['http://datacenters.com/locations']

def parse(self, response):
    for link in response.css('div.LocationsSearch__location__J7LUu a::attr(href)'):
        yield response.follow('https://www.datacenters.com'+link.get(), callback = self.get_info)

def get_info(self, response):
    yield {'Full Name': response.css('h1.LocationProviderDetail__locationNameXs__2UKtL::text').get(),
           'Number': response.xpath('//div[@class="LocationProviderDetail__phoneItemWrapper__3-SfO"]/div/span/text()').extract_first(),
           'SQFT': response.xpath('//div[@class="LocationProviderDetail__facilityDetails__M1ErX"]/div/span/text()').extract_first()
        }

If you view the page in a browser, and log your network traffic while clicking through the result pages, you'll notice an XHR HTTP GET request being made to a REST API endpoint, the response of which is JSON and contains a lot of information for all warehouse locations for a given page of 40 results. You can imitate that request, and even tailor the query-string parameters to get more than 40 results at a time (notice the per_page key-value pair in the params dictionary in get_locations - the default is "40" , I've set it to "3000" to capture all 2591 locations).

The API's response does not contain the phone numbers or square-footage, however - but they do contain a relative slug URL for each location. You can navigate to each location URL, and scrape the missing information from each of these warehouse-specific pages (which, conveniently, is JSON placed inside a <script> tag):

def get_locations():
    import requests

    url = "https://www.datacenters.com/api/v1/locations"

    params = {
        "page": "1",
        "per_page": "3000",
        "query": "",
        "withProducts": "false",
        "showHidden": "false",
        "nearby": "false",
        "radius": "0",
        "bounds": "",
        "circleBounds": "",
        "polygonPath": ""
    }

    headers = {
        "Accept": "application/json",
        "Accept-Encoding": "gzip, deflate",
        "User-Agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    yield from response.json()["locations"]


def get_additional_info(location):
    import requests
    from bs4 import BeautifulSoup as Soup
    import json

    url = "https://www.datacenters.com" + location["url"]

    headers = {
        "Accept-Encoding": "gzip, deflate",
        "User-Agent": "Mozilla/5.0"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    script = soup.select_one("script[data-component-name=\"LocationProviderDetail\"]")
    content = json.loads(script.string)

    return content["location"]["phone"], content["location"]["grossBuildingSize"]


def main():

    from itertools import islice
    
    for location in islice(get_locations(), 5):
        phone, sqft = get_additional_info(location)
        print("{}\n{}\n{}\n{}\n".format(
            location["name"],
            location["fullAddress"],
            sqft,
            phone
        ))
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

IAD39 44274 Round Table Plaza Data Center
Ashburn, VA, USA
1057000 Sqft
+1 877-882-7470

IAD40 44372 Round Table Plaza Data Center
Ashburn, VA, USA
223200 Sqft
+1 877-882-7470

AWS IAD71 Data Center
21263 Smith Switch Rd, Ashburn, VA, USA
Not Available
+1 844-902-4700

AWS IAD60  Ashburn Data Center
21267 Smith Switch Road, Ashburn, VA, USA
Not Available
+1 844-902-4700

Ashburn Data Center
21635 Red Rum Drive, Ashburn, VA, USA
71000 Sqft
+1 877-215-2422

>>> 

Here I'm using itertools.islice to get only the first five results, but you get the idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM