So I am currently making a scraper project for this one website: https://www.datacenters.com/locations?page=1&per_page=40&query=&withProducts=false&showHidden=false&nearby=false&radius=0&bounds=&circleBounds=&polygonPath= . It goes through all the different data center locations and prints out a csv (done through vs code and uses terminal command scrapy crawl datacenters -o datacenters.csv to run). Maybe I should be doing a JSON file instead? Contemplating using pandas as well. For some reason, no matter what I change, I can't get my code to scrape more than the first page. I would appreciate any help at all, thanks. I just need to know what else to edit/add so I can scrape most if not all pages, possibly make a loop?
import scrapy
import pandas as pd
class DatacentersSpider(scrapy.Spider):
name = 'datacenters'
allowed_domains = ['datacenters.com']
start_urls = ['http://datacenters.com/locations']
def parse(self, response):
for link in response.css('div.LocationsSearch__location__J7LUu a::attr(href)'):
yield response.follow('https://www.datacenters.com'+link.get(), callback = self.get_info)
def get_info(self, response):
yield {'Full Name': response.css('h1.LocationProviderDetail__locationNameXs__2UKtL::text').get(),
'Number': response.xpath('//div[@class="LocationProviderDetail__phoneItemWrapper__3-SfO"]/div/span/text()').extract_first(),
'SQFT': response.xpath('//div[@class="LocationProviderDetail__facilityDetails__M1ErX"]/div/span/text()').extract_first()
}
If you view the page in a browser, and log your network traffic while clicking through the result pages, you'll notice an XHR HTTP GET request being made to a REST API endpoint, the response of which is JSON and contains a lot of information for all warehouse locations for a given page of 40 results. You can imitate that request, and even tailor the query-string parameters to get more than 40 results at a time (notice the per_page
key-value pair in the params
dictionary in get_locations
- the default is "40"
, I've set it to "3000"
to capture all 2591 locations).
The API's response does not contain the phone numbers or square-footage, however - but they do contain a relative slug URL for each location. You can navigate to each location URL, and scrape the missing information from each of these warehouse-specific pages (which, conveniently, is JSON placed inside a <script>
tag):
def get_locations():
import requests
url = "https://www.datacenters.com/api/v1/locations"
params = {
"page": "1",
"per_page": "3000",
"query": "",
"withProducts": "false",
"showHidden": "false",
"nearby": "false",
"radius": "0",
"bounds": "",
"circleBounds": "",
"polygonPath": ""
}
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
"User-Agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
yield from response.json()["locations"]
def get_additional_info(location):
import requests
from bs4 import BeautifulSoup as Soup
import json
url = "https://www.datacenters.com" + location["url"]
headers = {
"Accept-Encoding": "gzip, deflate",
"User-Agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
script = soup.select_one("script[data-component-name=\"LocationProviderDetail\"]")
content = json.loads(script.string)
return content["location"]["phone"], content["location"]["grossBuildingSize"]
def main():
from itertools import islice
for location in islice(get_locations(), 5):
phone, sqft = get_additional_info(location)
print("{}\n{}\n{}\n{}\n".format(
location["name"],
location["fullAddress"],
sqft,
phone
))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
IAD39 44274 Round Table Plaza Data Center
Ashburn, VA, USA
1057000 Sqft
+1 877-882-7470
IAD40 44372 Round Table Plaza Data Center
Ashburn, VA, USA
223200 Sqft
+1 877-882-7470
AWS IAD71 Data Center
21263 Smith Switch Rd, Ashburn, VA, USA
Not Available
+1 844-902-4700
AWS IAD60 Ashburn Data Center
21267 Smith Switch Road, Ashburn, VA, USA
Not Available
+1 844-902-4700
Ashburn Data Center
21635 Red Rum Drive, Ashburn, VA, USA
71000 Sqft
+1 877-215-2422
>>>
Here I'm using itertools.islice
to get only the first five results, but you get the idea.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.