简体   繁体   English

如果在 href 属性中的 HTML 元素中我们在 scrapy 中有 href='#',如何跟踪链接?

[英]How to follow links if in HTML element in href attribute we have href='#' in scrapy?

I am trying to scrape Niche.com website to extract all schools and details of schools which are present in each school links but if we try to follow the school link in href attribute we have href = "#" so scrapy unable to get inside each school page and collect the data我正在尝试抓取 Niche.com 网站以提取每个学校链接中存在的所有学校和学校的详细信息,但是如果我们尝试跟踪 href 属性中的学校链接,我们有 href =“#”,因此 scrapy 无法进入每个学校学校页面并收集数据

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NicheschoolsSpider(scrapy.Spider):
    name = 'nicheschools'
    allowed_domains = ['www.niche.com']
    start_urls = ['https://www.niche.com/k12/search/best-schools/s/wisconsin/']

def parse(self, response):
    schoollink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
    for school in schoollink:
        name= school.xpath(".//text()").get()
        link = school.xpath(".//@href").get()
        yield {
            'name':name,
            'link':link
        }
        yield response.follow(url=link,callback =self.parse_schools)


def parse_schools(self,response):
    name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
    website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
    address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()

    yield{
        'name':name,
        "website":website,
        'address':address
    }

OUTPUT FOR ONE ENTRY: 2023-01-25 16:33:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.niche.com/k12/search/best-schools/s/wisconsin/%5C\> {'name': 'Brookfield Central High School', 'link': '#'} when it try to get inside link shown below 2023-01-25 16:33:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.niche.com/k12/search/best-schools/s/wisconsin/%5C\> {'name': None, 'website': None, 'address': None} OUTPUT 对于一个条目:2023-01-25 16:33:10 [scrapy.core.scraper] 调试:从 <200 https://www.niche.com/k12/search/best-schools/s/wisconsin/ 中删除%5C\> {'name': 'Brookfield Central High School', 'link': '#'} 当它试图进入如下所示的内部链接时 2023-01-25 16:33:12 [scrapy.core.scraper]调试:从 <200 https://www.niche.com/k12/search/best-schools/s/wisconsin/%5C\> {'名称':无,'网站':无,'地址':无}

Trying to get inside each school link and collect schoolname, address, telephone, tutuion fees, enrollment for particular link.试图进入每个学校链接并收集学校名称、地址、电话、学费、特定链接的注册信息。

Not really a job for Scrapy, although it can certainly be accomplished with Scrapy. Website is dynamic, pulling data from an API endpoint.不是 Scrapy 的真正工作,虽然它肯定可以用 Scrapy 完成。网站是动态的,从 API 端点提取数据。 I won't be setting up a Scrapy project just to answer your question, but I will demonstrate how you can get the data using Requests and pandas (code is ran in Jupyter notebook):我不会设置一个 Scrapy 项目来回答你的问题,但我将演示如何使用请求和 pandas 获取数据(代码在 Jupyter 笔记本中运行):

import requests
import pandas as pd
from tqdm.notebook import tqdm

pd.set_option('display.max_columns', None, 'display.max_colwidth', None)

headers = {
    'accept-language': 'en-US,en;q=0.9',
    'accept': 'application/json',
    'referer': 'https://www.niche.com/k12/search/best-schools/s/wisconsin/',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}

big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)

for x in tqdm(range(1, 5)):
    r = s.get(f'https://www.niche.com/api/renaissance/results/?state=wisconsin&listURL=best-schools&page={x}&searchType=school')
    df = pd.json_normalize(r.json(), record_path=['entities'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
display(big_df)

Result in terminal:终端结果:

100%
4/4 [00:02<00:00, 2.12it/s]
guid    ctas    badge.display   badge.ordinal   badge.total badge.vanityURL badge.photoURLs.desktop badge.photoURLs.mobile  content.centroid.lat    content.centroid.lon    content.entity.abbreviation content.entity.alternates.nces  content.entity.character    content.entity.claimed  content.entity.displayable  content.entity.genus    content.entity.guid content.entity.isClaimed    content.entity.isPremium    content.entity.location content.entity.name content.entity.parentGUIDs.county   content.entity.parentGUIDs.metroArea    content.entity.parentGUIDs.state    content.entity.parentGUIDs.town content.entity.parentGUIDs.zipCode  content.entity.premium  content.entity.published    content.entity.shortName    content.entity.tagline  content.entity.type content.entity.url  content.entity.variation    content.facts   content.featuredReview.author   content.featuredReview.body content.featuredReview.categories   content.featuredReview.created  content.featuredReview.guid content.featuredReview.rating   content.grades  content.photos.default.crops.DesktopHeader  content.photos.default.crops.MobileHeader   content.photos.default.crops.Original   content.photos.default.guid content.photos.default.licenseName  content.photos.editorial.crops.Original content.photos.editorial.guid   content.photos.editorial.licenseName    content.photos.editorial.uploadTimestamp    content.photos.mapbox_header.author content.photos.mapbox_header.crops.DesktopHeader    content.photos.mapbox_header.crops.MobileHeader content.photos.mapbox_header.guid   content.photos.mapbox_header.licenseName    content.photos.mapbox_header.licenseUrl content.photos.mapbox_header.sourceUrl  content.photos.spotlight.crops.Original content.photos.spotlight.crops.Spotlight    content.photos.spotlight.guid   content.photos.spotlight.licenseName    content.photos.spotlight.uploadTimestamp    content.reviewAverage.average   content.reviewAverage.count content.virtualTour content.entity.alternates.ceeb  content.photos.default.crops.Thumbnail  content.photos.default.uploadTimestamp  content.entity.parentGUIDs.parent   content.entity.parentGUIDs.schoolDistrict   content.entity.parentGUIDs.schoolNetwork    content.entity.parentGUIDs.neighborhood content.photos.default.crops.Spotlight
0   d6574ad4-6add-45c3-a90a-9d24f58b040e    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  1   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.081737   -88.145195  Brookfield Academy  0   Private True    True    Private School  d6574ad4-6add-45c3-a90a-9d24f58b040e    True    True    BROOKFIELD, WI  Brookfield Academy  ba8709ae-856d-4583-83b7-4484b51ed4c2    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    cc01665b-5240-4885-b13d-a4ae0dd271fc    b802227a-e061-45e5-9dfd-6c3ddaf8bebb    True    True    Brookfield Academy  [Private School, BROOKFIELD, WI, PK, K-12]  School  brookfield-academy-brookfield-wi    1041    [{'config': {'format': ['comma'], 'rounding': ...   Parent  When my kids started school something just did...   [Overall Experience]    2022-07-28T18:37:49.017538Z e3bfc3ad-86eb-4bba-8c66-5eb95e4111f7    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/d1d42e87...   https://d13b2ieg84qqce.cloudfront.net/c046f1e3...   https://d13b2ieg84qqce.cloudfront.net/d1d42e87...   a4125add-a984-4609-a879-ce1afa699db8    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/352e79e1...   53acd2d2-1185-49bd-9928-6a1f1054fba0    UNLICENSED  2022-02-10T21:15:52.569792Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   f696705b-0766-48e5-97b5-72370788f0c6    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/2273b7a3...   https://d13b2ieg84qqce.cloudfront.net/d512adbc...   2a0ddcf9-ae58-404f-9d91-13a5196c2217    UNLICENSED  2022-07-28T18:00:15.479792Z 4.333333    39  [{'label': 'Virtual Tour', 'value': 'https://w...   NaN NaN NaN NaN NaN NaN NaN NaN
1   c5ce3267-c2ed-4785-a5d8-66c61fcf6063    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  2   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.186400   -87.935800  USM 01512787    Private True    True    Private School  c5ce3267-c2ed-4785-a5d8-66c61fcf6063    True    True    WI  University School of Milwaukee  8b295479-c31f-47a9-83b8-94b2100e2832    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    739d0594-0714-4d74-ad01-f07df19bc756    5d98fbca-9d9d-4219-8335-8dba54962ca7    True    True    University School   [Private School, WI, PK, K-12]  School  university-school-of-milwaukee-river-hills-wi   1041    [{'config': {'format': ['comma'], 'rounding': ...   Parent  It is clear to see, in the short time we’ve be...   [Overall Experience]    2022-10-28T07:41:07.70707Z  a7a94913-bb20-4def-9553-761720f5cac8    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/184acaa6...   https://d13b2ieg84qqce.cloudfront.net/d98566c3...   https://d13b2ieg84qqce.cloudfront.net/c65ee0e3...   be9334fb-56d4-4c0c-a4b9-b2de53c46b09    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/887e0e98...   cfb21e87-82a7-4fa5-8af6-bf33d199039a    UNLICENSED  2022-02-10T21:12:07.916464Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   733bf01a-d21c-4374-bb52-42175a61a2c2    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/e80f6114...   https://d13b2ieg84qqce.cloudfront.net/e80f6114...   c8cb35e1-b83c-47f1-a649-eb3766a53de7    UNLICENSED  NaN 4.209524    105 [{'label': 'Virtual Tour'}] 501390  https://d13b2ieg84qqce.cloudfront.net/97d061e2...   2022-07-11T13:31:31.710239Z NaN NaN NaN NaN NaN
2   84ab245d-ad99-43c9-93d8-9e474a109434    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  3   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.163916   -89.385004  MCDS    A9904507    Private True    True    Private School  84ab245d-ad99-43c9-93d8-9e474a109434    True    True    WAUNAKEE, WI    Madison Country Day School  4135e47a-62f6-4777-b514-d2e51894603f    1a1aaa73-65d0-490d-b3d3-d828716c5f6b    963a1085-efe7-45f5-81ee-d2bbf82a907c    NaN 3bca1e55-0153-485a-a337-03448396568b    True    True    MCDS    [Private School, WAUNAKEE, WI, PK, K-12]    School  madison-country-day-school-waunakee-wi  1041    [{'config': {'format': ['comma'], 'rounding': ...   Parent  The MCDS faculty is truly exceptional -- they ...   [Overall Experience]    2022-07-22T13:59:50.567397Z 6c714271-25ac-4206-9ef8-38d3ef1f92d6    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/68e0beb3...   https://d13b2ieg84qqce.cloudfront.net/3a1cfdcf...   https://d13b2ieg84qqce.cloudfront.net/b2d1416c...   86a7a6ce-2538-4bf1-8703-6b3b44fda5a4    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/6cb7bdfd...   809aece8-55ce-4632-a3cf-d0a14417ffdc    UNLICENSED  2022-02-09T21:25:15.513499Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   dc5b6bd7-5a5c-48ee-bdfd-5780de198bc9    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/66e8fd60...   https://d13b2ieg84qqce.cloudfront.net/fb59b45f...   2e6d1bd0-7760-46ae-8ce2-02306508b864    UNLICENSED  2022-04-18T18:58:56.007652Z 3.882353    34  [{'label': 'Virtual Tour', 'value': 'https://w...   502396  https://d13b2ieg84qqce.cloudfront.net/6ea5d8cb...   2022-06-08T22:11:36.605259Z NaN NaN NaN NaN NaN
3   35ca6237-c994-4fe6-b5f9-f09142680d7b    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  4   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.457700   -88.827400  Wayland Academy 01514944    Private, Boarding   True    True    Private School  35ca6237-c994-4fe6-b5f9-f09142680d7b    True    True    BEAVER DAM, WI  Wayland Academy 3c05ff22-e610-450d-8684-1b9f99edcd1f    NaN 963a1085-efe7-45f5-81ee-d2bbf82a907c    1d49bb1b-d2a1-45e2-ac8e-c8d16ab29f3e    f132a02a-1ead-4325-bf32-9079b435d74c    True    True    Wayland [Private School, BEAVER DAM, WI, 9-12]  School  wayland-academy-beaver-dam-wi   1040    [{'config': {'format': ['comma'], 'rounding': ...   Alum    Though I only attended Wayland for two years (...   [Overall Experience]    2022-08-14T20:05:05.231126Z a0bf7334-047c-4ee8-ab95-59c46dff42b3    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/7cc728a3...   https://d13b2ieg84qqce.cloudfront.net/5e24f8a2...   https://d13b2ieg84qqce.cloudfront.net/d7835cfd...   99230263-8332-4b03-b475-b948546402b7    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/42561f2c...   697e0f82-7ccb-4651-877e-ffe881e188c5    UNLICENSED  NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   c641160c-30c7-4b52-b336-e844ac8a059a    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/3aaf34d3...   https://d13b2ieg84qqce.cloudfront.net/f197c0a9...   124661d4-71cd-4d13-bfcc-926f3e074ade    UNLICENSED  2022-09-28T16:06:46.315837Z 3.833333    66  [{'label': 'Virtual Tour', 'value': 'https://y...   500170  https://d13b2ieg84qqce.cloudfront.net/9b54f4ea...   2022-07-26T17:26:09.050891Z NaN NaN NaN NaN NaN
4   9b394d9c-46a0-431d-8ae4-62b6142cd46b    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  5   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   42.773585   -87.774410  TPS 01513124    Private True    True    Private School  9b394d9c-46a0-431d-8ae4-62b6142cd46b    True    False   WIND POINT, WI  The Prairie School  5455e716-0063-4d63-a0e2-a07d199cdee1    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    5ef4c7c2-c006-49ea-88e9-9f40a0da6ce6    0d949807-5d44-4fc8-8753-1ce81f4a5d67    False   True    Prairie [Private School, WIND POINT, WI, PK, K-12]  School  the-prairie-school-wind-point-wi    41  [{'config': {'format': ['comma'], 'rounding': ...   Alum    The teachers are awesome and so approachable! ...   [Overall Experience]    2020-06-23T03:25:59.897153Z 2d7de44a-38a7-493c-ac87-a024ba85d42d    5.0 [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN https://d13b2ieg84qqce.cloudfront.net/608f2378...   a69ad3c5-f274-4bbe-ab1b-f1977c79c6f9    UNLICENSED  2022-02-10T20:38:50.869965Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   86b0f616-f1bf-4123-a2bc-f93255053083    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/a26a05f2...   https://d13b2ieg84qqce.cloudfront.net/a26a05f2...   ee7758c3-09ee-4d81-b4bd-c7f91c98652a    UNLICENSED  NaN 4.642857    70  [{'label': 'Virtual Tour', 'value': 'https://w...   501918  NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95  8365632c-160e-4beb-b75c-dfafca1c2441    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Middle Schools in Wisconsin 24  594 best-public-middle-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   44.891018   -87.290723  Sevastopol Middle School    551350000496    Public  False   True    Public School   8365632c-160e-4beb-b75c-dfafca1c2441    False   False   STURGEON BAY, WI    Sevastopol Middle School    caaa657e-9c5e-4740-b72f-bef5b2c75ac1    NaN 963a1085-efe7-45f5-81ee-d2bbf82a907c    65ab2591-75de-487d-8a82-bddd79e3d3bd    7ae55b50-154c-4e0e-aff7-ed2726f7ceb8    False   True    Sevastopol Middle School    [Sevastopol School District, WI, 6-8]   School  sevastopol-middle-school-sturgeon-bay-wi    45  [{'config': {'format': ['comma'], 'rounding': ...   NaN NaN NaN NaN NaN NaN [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   e750dc05-07ed-42b0-92e3-f24ab16f1b8b    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 0.000000    0   [{'label': 'Virtual Tour'}] NaN NaN NaN d4d24c63-d104-44cd-ad3f-0ded85522583    d4d24c63-d104-44cd-ad3f-0ded85522583    NaN NaN NaN
96  d46a53a4-62f4-4086-9a53-4c6f78f54915    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Elementary Schools in Wisconsin 36  1074    best-public-elementary-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   42.935063   -88.405594  Prairie View Elementary School  551006001321    Public  True    True    Public School   d46a53a4-62f4-4086-9a53-4c6f78f54915    True    False   NORTH PRAIRIE, WI   Prairie View Elementary School  ba8709ae-856d-4583-83b7-4484b51ed4c2    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    1a756678-9c81-4d89-8620-604b8e10507c    8afa0d18-4b1f-4052-a09a-d9bfe3e67295    False   True    Prairie View Elementary School  [Mukwonago Area School District, WI, PK, K-6]   School  prairie-view-elementary-school-north-prairie-wi 45  [{'config': {'format': ['comma'], 'rounding': ...   NaN NaN NaN NaN NaN NaN [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   51a79bc3-f67e-4b49-87b1-d36ef46e1145    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 0.000000    0   [{'label': 'Virtual Tour'}] NaN NaN NaN bda72d2a-3f49-4288-a9f2-d024898ca67b    bda72d2a-3f49-4288-a9f2-d024898ca67b    NaN NaN NaN
97  0ed32d0d-f062-4784-8dd4-57a9724209eb    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Middle Schools in Wisconsin 25  594 best-public-middle-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.624047   -87.786299  Oostburg Middle School  551107001464    Public  False   True    Public School   0ed32d0d-f062-4784-8dd4-57a9724209eb    False   False   OOSTBURG, WI    Oostburg Middle School  1db5c6d2-5b8f-44fa-87fc-f7471ee45443    NaN 963a1085-efe7-45f5-81ee-d2bbf82a907c    b44a651d-bcfc-4d95-a6ce-f00c0c42671e    d594fdef-3441-462c-93ad-981c8fd1f064    False   True    Oostburg Middle School  [Oostburg School District, WI, 6-8] School  oostburg-middle-school-oostburg-wi  45  [{'config': {'format': ['comma'], 'rounding': ...   Niche User  The middle school does a great job at preparin...   [Academics] 2015-02-12T14:29:22Z    0c61a295-b7c7-44e9-ab7a-64993190796f    5.0 [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   32fa503a-a377-44e8-bb10-f9a7fa0bb67c    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 4.800000    10  [{'label': 'Virtual Tour'}] NaN NaN NaN 3478a622-503f-47d5-93a0-c3207124cdd4    3478a622-503f-47d5-93a0-c3207124cdd4    NaN NaN NaN
98  17b8ac12-d893-4af4-bf79-3aa06bef648a    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public High Schools in Wisconsin   24  496 best-public-high-schools/s/wisconsin    https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   42.993816   -88.224033  WEPA    551578002688    Public, Charter True    True    Charter School  17b8ac12-d893-4af4-bf79-3aa06bef648a    True    False   WAUKESHA, WI    Waukesha Engineering Preparatory Academy    ba8709ae-856d-4583-83b7-4484b51ed4c2    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    5a94913e-87ac-4e4e-9b76-a2330bf1a635    b88c94da-24d3-4004-b43b-547d9da55e0d    False   True    Waukesha Engineering Preparatory Academy    [School District of Waukesha, WI, 9-12] School  waukesha-engineering-preparatory-academy-wauke...   52  [{'config': {'format': ['comma'], 'rounding': ...   Senior  The Academy is well equipped and staffed, and ...   [Overall Experience]    2021-10-13T21:07:42.049714Z 8d8241ae-fc49-4bab-a146-48d77f1e6391    4.0 [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   b1db421b-6290-4a08-a854-82dd9089e116    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 3.681818    22  [{'label': 'Virtual Tour'}] 500331  NaN NaN a368f833-c451-45bb-a0f7-b656d02477f3    a368f833-c451-45bb-a0f7-b656d02477f3    NaN NaN NaN
99  c3b20454-cd71-45bb-ab33-0b3ea37527fb    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Elementary Schools in Wisconsin 37  1074    best-public-elementary-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.089194   -87.883770  Atwater Elementary School   551380001809    Public  True    True    Public School   c3b20454-cd71-45bb-ab33-0b3ea37527fb    True    False   SHOREWOOD, WI   Atwater Elementary School   8b295479-c31f-47a9-83b8-94b2100e2832    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    900b6b9c-206e-4c34-82a8-247fee552b49    542c1289-ad69-4fc3-afab-bf91c1a6110e    False   True    Atwater Elementary School   [Shorewood School District, WI, PK, K-6]    School  atwater-elementary-school-shorewood-wi  45  [{'config': {'format': ['comma'], 'rounding': ...   NaN NaN NaN NaN NaN NaN [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   b523d19c-fdd1-497b-bd6d-ab394cde0dbf    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 0.000000    0   [{'label': 'Virtual Tour', 'value': 'https://w...   NaN NaN NaN 84c36616-1b72-4d85-998d-c9795aadb726    84c36616-1b72-4d85-998d-c9795aadb726    NaN NaN NaN
100 rows × 73 columns

You can get all data by adjusting the range (go for 123 for max records).您可以通过调整范围来获取所有数据(最大记录为 123)。 Also, you may want to add some pause between requests, otherwise you'd be blocked.此外,您可能希望在请求之间添加一些暂停,否则您会被阻止。 You can also use Scrapy, if you wish.如果您愿意,也可以使用 Scrapy。

You need to check carefully the HTML because you can find the url inside one div你需要仔细检查HTML因为你可以在一个div中找到url

import scrapy


class NicheschoolsSpider(scrapy.Spider):
    name = 'nicheschools'
    allowed_domains = ['www.niche.com']
    start_urls = ['https://www.niche.com/k12/search/best-schools/s/wisconsin/']

    def parse(self, response):
        school_links = response.xpath("//div[@class='card ']/a/@href").extract()

        for link in school_links:
            yield response.follow(url=link, callback=self.parse_schools)

    def parse_schools(self, response):
        name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").extract_first()
        website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").extract_first()
        address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").extract_first()

        yield {
            'name': name,
            'link': response.url,
            'website': website,
            'address': address,
        }

Result on json结果在 json

{'name': 'Brookfield Academy', 'link': 'https://www.niche.com/k12/brookfield-academy-brookfield-wi/', 'website': 'https://www.brookfieldacademy.org', 'address': '3462 N BROOKFIELD RD'}
{'name': 'Wisconsin Lutheran High School', 'link': 'https://www.niche.com/k12/wisconsin-lutheran-high-school-milwaukee-wi/', 'website': 'https://www.wlhs.org', 'address': '330 N GLENVIEW AVE'}
{'name': 'Homestead High School', 'link': 'https://www.niche.com/k12/homestead-high-school-mequon-wi/', 'website': 'http://www.mtsd.k12.wi.us/homestead/', 'address': '5000 W MEQUON RD'}
{'name': 'Brookfield Central High School', 'link': 'https://www.niche.com/k12/brookfield-central-high-school-brookfield-wi/', 'website': 'https://www.elmbrookschools.org/brookfield-central-high-school', 'address': '16900 W GEBHARDT RD'}
{'name': 'Shorewood High School', 'link': 'https://www.niche.com/k12/shorewood-high-school-shorewood-wi/', 'website': 'https://www.shorewood.k12.wi.us/apps/pages/shs', 'address': '1701 E CAPITOL DR'}
{'name': 'School District of Waukesha', 'link': 'https://www.niche.com/k12/d/school-district-of-waukesha-wi/', 'website': 'https://sdw.waukesha.k12.wi.us', 'address': '222 MAPLE AVE'}
{'name': 'Pilgrim Park Middle School', 'link': 'https://www.niche.com/k12/pilgrim-park-middle-school-elm-grove-wi/', 'website': 'http://www.elmbrookschools.org/', 'address': '1500 PILGRIM PKWY'}
{'name': 'Marquette University High School', 'link': 'https://www.niche.com/k12/marquette-university-high-school-milwaukee-wi/', 'website': 'https://www.muhs.edu/', 'address': '3401 W WISCONSIN AVE'}
...

If you are new on web scraping you need to be careful with over hitting the site because they could block you and then you need to solve a captcha solution for enter the site.如果您是 web 抓取的新手,您需要小心不要过度访问该站点,因为它们可能会阻止您,然后您需要解决验证码解决方案才能进入该站点。

Also If you want to expand your knowledge there are clusters of web scraping like Estela where you can run your spiders and also create cronjobs for do it everyday.此外,如果你想扩展你的知识,可以使用像Estela这样的 web 集群,你可以在其中运行你的蜘蛛,还可以创建 cronjobs 每天都这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM