簡體   English   中英

如果在 href 屬性中的 HTML 元素中我們在 scrapy 中有 href='#',如何跟蹤鏈接?

[英]How to follow links if in HTML element in href attribute we have href='#' in scrapy?

我正在嘗試抓取 Niche.com 網站以提取每個學校鏈接中存在的所有學校和學校的詳細信息,但是如果我們嘗試跟蹤 href 屬性中的學校鏈接,我們有 href =“#”,因此 scrapy 無法進入每個學校學校頁面並收集數據

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NicheschoolsSpider(scrapy.Spider):
    name = 'nicheschools'
    allowed_domains = ['www.niche.com']
    start_urls = ['https://www.niche.com/k12/search/best-schools/s/wisconsin/']

def parse(self, response):
    schoollink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
    for school in schoollink:
        name= school.xpath(".//text()").get()
        link = school.xpath(".//@href").get()
        yield {
            'name':name,
            'link':link
        }
        yield response.follow(url=link,callback =self.parse_schools)


def parse_schools(self,response):
    name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
    website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
    address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()

    yield{
        'name':name,
        "website":website,
        'address':address
    }

OUTPUT 對於一個條目:2023-01-25 16:33:10 [scrapy.core.scraper] 調試:從 <200 https://www.niche.com/k12/search/best-schools/s/wisconsin/ 中刪除%5C\> {'name': 'Brookfield Central High School', 'link': '#'} 當它試圖進入如下所示的內部鏈接時 2023-01-25 16:33:12 [scrapy.core.scraper]調試:從 <200 https://www.niche.com/k12/search/best-schools/s/wisconsin/%5C\> {'名稱':無,'網站':無,'地址':無}

試圖進入每個學校鏈接並收集學校名稱、地址、電話、學費、特定鏈接的注冊信息。

不是 Scrapy 的真正工作,雖然它肯定可以用 Scrapy 完成。網站是動態的,從 API 端點提取數據。 我不會設置一個 Scrapy 項目來回答你的問題,但我將演示如何使用請求和 pandas 獲取數據(代碼在 Jupyter 筆記本中運行):

import requests
import pandas as pd
from tqdm.notebook import tqdm

pd.set_option('display.max_columns', None, 'display.max_colwidth', None)

headers = {
    'accept-language': 'en-US,en;q=0.9',
    'accept': 'application/json',
    'referer': 'https://www.niche.com/k12/search/best-schools/s/wisconsin/',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}

big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)

for x in tqdm(range(1, 5)):
    r = s.get(f'https://www.niche.com/api/renaissance/results/?state=wisconsin&listURL=best-schools&page={x}&searchType=school')
    df = pd.json_normalize(r.json(), record_path=['entities'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
display(big_df)

終端結果:

100%
4/4 [00:02<00:00, 2.12it/s]
guid    ctas    badge.display   badge.ordinal   badge.total badge.vanityURL badge.photoURLs.desktop badge.photoURLs.mobile  content.centroid.lat    content.centroid.lon    content.entity.abbreviation content.entity.alternates.nces  content.entity.character    content.entity.claimed  content.entity.displayable  content.entity.genus    content.entity.guid content.entity.isClaimed    content.entity.isPremium    content.entity.location content.entity.name content.entity.parentGUIDs.county   content.entity.parentGUIDs.metroArea    content.entity.parentGUIDs.state    content.entity.parentGUIDs.town content.entity.parentGUIDs.zipCode  content.entity.premium  content.entity.published    content.entity.shortName    content.entity.tagline  content.entity.type content.entity.url  content.entity.variation    content.facts   content.featuredReview.author   content.featuredReview.body content.featuredReview.categories   content.featuredReview.created  content.featuredReview.guid content.featuredReview.rating   content.grades  content.photos.default.crops.DesktopHeader  content.photos.default.crops.MobileHeader   content.photos.default.crops.Original   content.photos.default.guid content.photos.default.licenseName  content.photos.editorial.crops.Original content.photos.editorial.guid   content.photos.editorial.licenseName    content.photos.editorial.uploadTimestamp    content.photos.mapbox_header.author content.photos.mapbox_header.crops.DesktopHeader    content.photos.mapbox_header.crops.MobileHeader content.photos.mapbox_header.guid   content.photos.mapbox_header.licenseName    content.photos.mapbox_header.licenseUrl content.photos.mapbox_header.sourceUrl  content.photos.spotlight.crops.Original content.photos.spotlight.crops.Spotlight    content.photos.spotlight.guid   content.photos.spotlight.licenseName    content.photos.spotlight.uploadTimestamp    content.reviewAverage.average   content.reviewAverage.count content.virtualTour content.entity.alternates.ceeb  content.photos.default.crops.Thumbnail  content.photos.default.uploadTimestamp  content.entity.parentGUIDs.parent   content.entity.parentGUIDs.schoolDistrict   content.entity.parentGUIDs.schoolNetwork    content.entity.parentGUIDs.neighborhood content.photos.default.crops.Spotlight
0   d6574ad4-6add-45c3-a90a-9d24f58b040e    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  1   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.081737   -88.145195  Brookfield Academy  0   Private True    True    Private School  d6574ad4-6add-45c3-a90a-9d24f58b040e    True    True    BROOKFIELD, WI  Brookfield Academy  ba8709ae-856d-4583-83b7-4484b51ed4c2    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    cc01665b-5240-4885-b13d-a4ae0dd271fc    b802227a-e061-45e5-9dfd-6c3ddaf8bebb    True    True    Brookfield Academy  [Private School, BROOKFIELD, WI, PK, K-12]  School  brookfield-academy-brookfield-wi    1041    [{'config': {'format': ['comma'], 'rounding': ...   Parent  When my kids started school something just did...   [Overall Experience]    2022-07-28T18:37:49.017538Z e3bfc3ad-86eb-4bba-8c66-5eb95e4111f7    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/d1d42e87...   https://d13b2ieg84qqce.cloudfront.net/c046f1e3...   https://d13b2ieg84qqce.cloudfront.net/d1d42e87...   a4125add-a984-4609-a879-ce1afa699db8    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/352e79e1...   53acd2d2-1185-49bd-9928-6a1f1054fba0    UNLICENSED  2022-02-10T21:15:52.569792Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   f696705b-0766-48e5-97b5-72370788f0c6    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/2273b7a3...   https://d13b2ieg84qqce.cloudfront.net/d512adbc...   2a0ddcf9-ae58-404f-9d91-13a5196c2217    UNLICENSED  2022-07-28T18:00:15.479792Z 4.333333    39  [{'label': 'Virtual Tour', 'value': 'https://w...   NaN NaN NaN NaN NaN NaN NaN NaN
1   c5ce3267-c2ed-4785-a5d8-66c61fcf6063    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  2   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.186400   -87.935800  USM 01512787    Private True    True    Private School  c5ce3267-c2ed-4785-a5d8-66c61fcf6063    True    True    WI  University School of Milwaukee  8b295479-c31f-47a9-83b8-94b2100e2832    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    739d0594-0714-4d74-ad01-f07df19bc756    5d98fbca-9d9d-4219-8335-8dba54962ca7    True    True    University School   [Private School, WI, PK, K-12]  School  university-school-of-milwaukee-river-hills-wi   1041    [{'config': {'format': ['comma'], 'rounding': ...   Parent  It is clear to see, in the short time we’ve be...   [Overall Experience]    2022-10-28T07:41:07.70707Z  a7a94913-bb20-4def-9553-761720f5cac8    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/184acaa6...   https://d13b2ieg84qqce.cloudfront.net/d98566c3...   https://d13b2ieg84qqce.cloudfront.net/c65ee0e3...   be9334fb-56d4-4c0c-a4b9-b2de53c46b09    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/887e0e98...   cfb21e87-82a7-4fa5-8af6-bf33d199039a    UNLICENSED  2022-02-10T21:12:07.916464Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   733bf01a-d21c-4374-bb52-42175a61a2c2    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/e80f6114...   https://d13b2ieg84qqce.cloudfront.net/e80f6114...   c8cb35e1-b83c-47f1-a649-eb3766a53de7    UNLICENSED  NaN 4.209524    105 [{'label': 'Virtual Tour'}] 501390  https://d13b2ieg84qqce.cloudfront.net/97d061e2...   2022-07-11T13:31:31.710239Z NaN NaN NaN NaN NaN
2   84ab245d-ad99-43c9-93d8-9e474a109434    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  3   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.163916   -89.385004  MCDS    A9904507    Private True    True    Private School  84ab245d-ad99-43c9-93d8-9e474a109434    True    True    WAUNAKEE, WI    Madison Country Day School  4135e47a-62f6-4777-b514-d2e51894603f    1a1aaa73-65d0-490d-b3d3-d828716c5f6b    963a1085-efe7-45f5-81ee-d2bbf82a907c    NaN 3bca1e55-0153-485a-a337-03448396568b    True    True    MCDS    [Private School, WAUNAKEE, WI, PK, K-12]    School  madison-country-day-school-waunakee-wi  1041    [{'config': {'format': ['comma'], 'rounding': ...   Parent  The MCDS faculty is truly exceptional -- they ...   [Overall Experience]    2022-07-22T13:59:50.567397Z 6c714271-25ac-4206-9ef8-38d3ef1f92d6    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/68e0beb3...   https://d13b2ieg84qqce.cloudfront.net/3a1cfdcf...   https://d13b2ieg84qqce.cloudfront.net/b2d1416c...   86a7a6ce-2538-4bf1-8703-6b3b44fda5a4    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/6cb7bdfd...   809aece8-55ce-4632-a3cf-d0a14417ffdc    UNLICENSED  2022-02-09T21:25:15.513499Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   dc5b6bd7-5a5c-48ee-bdfd-5780de198bc9    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/66e8fd60...   https://d13b2ieg84qqce.cloudfront.net/fb59b45f...   2e6d1bd0-7760-46ae-8ce2-02306508b864    UNLICENSED  2022-04-18T18:58:56.007652Z 3.882353    34  [{'label': 'Virtual Tour', 'value': 'https://w...   502396  https://d13b2ieg84qqce.cloudfront.net/6ea5d8cb...   2022-06-08T22:11:36.605259Z NaN NaN NaN NaN NaN
3   35ca6237-c994-4fe6-b5f9-f09142680d7b    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  4   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.457700   -88.827400  Wayland Academy 01514944    Private, Boarding   True    True    Private School  35ca6237-c994-4fe6-b5f9-f09142680d7b    True    True    BEAVER DAM, WI  Wayland Academy 3c05ff22-e610-450d-8684-1b9f99edcd1f    NaN 963a1085-efe7-45f5-81ee-d2bbf82a907c    1d49bb1b-d2a1-45e2-ac8e-c8d16ab29f3e    f132a02a-1ead-4325-bf32-9079b435d74c    True    True    Wayland [Private School, BEAVER DAM, WI, 9-12]  School  wayland-academy-beaver-dam-wi   1040    [{'config': {'format': ['comma'], 'rounding': ...   Alum    Though I only attended Wayland for two years (...   [Overall Experience]    2022-08-14T20:05:05.231126Z a0bf7334-047c-4ee8-ab95-59c46dff42b3    5.0 [{'description': 'Based on quality of academic...   https://d13b2ieg84qqce.cloudfront.net/7cc728a3...   https://d13b2ieg84qqce.cloudfront.net/5e24f8a2...   https://d13b2ieg84qqce.cloudfront.net/d7835cfd...   99230263-8332-4b03-b475-b948546402b7    UNLICENSED  https://d13b2ieg84qqce.cloudfront.net/42561f2c...   697e0f82-7ccb-4651-877e-ffe881e188c5    UNLICENSED  NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   c641160c-30c7-4b52-b336-e844ac8a059a    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/3aaf34d3...   https://d13b2ieg84qqce.cloudfront.net/f197c0a9...   124661d4-71cd-4d13-bfcc-926f3e074ade    UNLICENSED  2022-09-28T16:06:46.315837Z 3.833333    66  [{'label': 'Virtual Tour', 'value': 'https://y...   500170  https://d13b2ieg84qqce.cloudfront.net/9b54f4ea...   2022-07-26T17:26:09.050891Z NaN NaN NaN NaN NaN
4   9b394d9c-46a0-431d-8ae4-62b6142cd46b    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Private High Schools in Wisconsin  5   82  best-private-high-schools/s/wisconsin   https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   42.773585   -87.774410  TPS 01513124    Private True    True    Private School  9b394d9c-46a0-431d-8ae4-62b6142cd46b    True    False   WIND POINT, WI  The Prairie School  5455e716-0063-4d63-a0e2-a07d199cdee1    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    5ef4c7c2-c006-49ea-88e9-9f40a0da6ce6    0d949807-5d44-4fc8-8753-1ce81f4a5d67    False   True    Prairie [Private School, WIND POINT, WI, PK, K-12]  School  the-prairie-school-wind-point-wi    41  [{'config': {'format': ['comma'], 'rounding': ...   Alum    The teachers are awesome and so approachable! ...   [Overall Experience]    2020-06-23T03:25:59.897153Z 2d7de44a-38a7-493c-ac87-a024ba85d42d    5.0 [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN https://d13b2ieg84qqce.cloudfront.net/608f2378...   a69ad3c5-f274-4bbe-ab1b-f1977c79c6f9    UNLICENSED  2022-02-10T20:38:50.869965Z © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   86b0f616-f1bf-4123-a2bc-f93255053083    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  https://d13b2ieg84qqce.cloudfront.net/a26a05f2...   https://d13b2ieg84qqce.cloudfront.net/a26a05f2...   ee7758c3-09ee-4d81-b4bd-c7f91c98652a    UNLICENSED  NaN 4.642857    70  [{'label': 'Virtual Tour', 'value': 'https://w...   501918  NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95  8365632c-160e-4beb-b75c-dfafca1c2441    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Middle Schools in Wisconsin 24  594 best-public-middle-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   44.891018   -87.290723  Sevastopol Middle School    551350000496    Public  False   True    Public School   8365632c-160e-4beb-b75c-dfafca1c2441    False   False   STURGEON BAY, WI    Sevastopol Middle School    caaa657e-9c5e-4740-b72f-bef5b2c75ac1    NaN 963a1085-efe7-45f5-81ee-d2bbf82a907c    65ab2591-75de-487d-8a82-bddd79e3d3bd    7ae55b50-154c-4e0e-aff7-ed2726f7ceb8    False   True    Sevastopol Middle School    [Sevastopol School District, WI, 6-8]   School  sevastopol-middle-school-sturgeon-bay-wi    45  [{'config': {'format': ['comma'], 'rounding': ...   NaN NaN NaN NaN NaN NaN [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   e750dc05-07ed-42b0-92e3-f24ab16f1b8b    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 0.000000    0   [{'label': 'Virtual Tour'}] NaN NaN NaN d4d24c63-d104-44cd-ad3f-0ded85522583    d4d24c63-d104-44cd-ad3f-0ded85522583    NaN NaN NaN
96  d46a53a4-62f4-4086-9a53-4c6f78f54915    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Elementary Schools in Wisconsin 36  1074    best-public-elementary-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   42.935063   -88.405594  Prairie View Elementary School  551006001321    Public  True    True    Public School   d46a53a4-62f4-4086-9a53-4c6f78f54915    True    False   NORTH PRAIRIE, WI   Prairie View Elementary School  ba8709ae-856d-4583-83b7-4484b51ed4c2    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    1a756678-9c81-4d89-8620-604b8e10507c    8afa0d18-4b1f-4052-a09a-d9bfe3e67295    False   True    Prairie View Elementary School  [Mukwonago Area School District, WI, PK, K-6]   School  prairie-view-elementary-school-north-prairie-wi 45  [{'config': {'format': ['comma'], 'rounding': ...   NaN NaN NaN NaN NaN NaN [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   51a79bc3-f67e-4b49-87b1-d36ef46e1145    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 0.000000    0   [{'label': 'Virtual Tour'}] NaN NaN NaN bda72d2a-3f49-4288-a9f2-d024898ca67b    bda72d2a-3f49-4288-a9f2-d024898ca67b    NaN NaN NaN
97  0ed32d0d-f062-4784-8dd4-57a9724209eb    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Middle Schools in Wisconsin 25  594 best-public-middle-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.624047   -87.786299  Oostburg Middle School  551107001464    Public  False   True    Public School   0ed32d0d-f062-4784-8dd4-57a9724209eb    False   False   OOSTBURG, WI    Oostburg Middle School  1db5c6d2-5b8f-44fa-87fc-f7471ee45443    NaN 963a1085-efe7-45f5-81ee-d2bbf82a907c    b44a651d-bcfc-4d95-a6ce-f00c0c42671e    d594fdef-3441-462c-93ad-981c8fd1f064    False   True    Oostburg Middle School  [Oostburg School District, WI, 6-8] School  oostburg-middle-school-oostburg-wi  45  [{'config': {'format': ['comma'], 'rounding': ...   Niche User  The middle school does a great job at preparin...   [Academics] 2015-02-12T14:29:22Z    0c61a295-b7c7-44e9-ab7a-64993190796f    5.0 [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   32fa503a-a377-44e8-bb10-f9a7fa0bb67c    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 4.800000    10  [{'label': 'Virtual Tour'}] NaN NaN NaN 3478a622-503f-47d5-93a0-c3207124cdd4    3478a622-503f-47d5-93a0-c3207124cdd4    NaN NaN NaN
98  17b8ac12-d893-4af4-bf79-3aa06bef648a    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public High Schools in Wisconsin   24  496 best-public-high-schools/s/wisconsin    https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   42.993816   -88.224033  WEPA    551578002688    Public, Charter True    True    Charter School  17b8ac12-d893-4af4-bf79-3aa06bef648a    True    False   WAUKESHA, WI    Waukesha Engineering Preparatory Academy    ba8709ae-856d-4583-83b7-4484b51ed4c2    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    5a94913e-87ac-4e4e-9b76-a2330bf1a635    b88c94da-24d3-4004-b43b-547d9da55e0d    False   True    Waukesha Engineering Preparatory Academy    [School District of Waukesha, WI, 9-12] School  waukesha-engineering-preparatory-academy-wauke...   52  [{'config': {'format': ['comma'], 'rounding': ...   Senior  The Academy is well equipped and staffed, and ...   [Overall Experience]    2021-10-13T21:07:42.049714Z 8d8241ae-fc49-4bab-a146-48d77f1e6391    4.0 [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   b1db421b-6290-4a08-a854-82dd9089e116    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 3.681818    22  [{'label': 'Virtual Tour'}] 500331  NaN NaN a368f833-c451-45bb-a0f7-b656d02477f3    a368f833-c451-45bb-a0f7-b656d02477f3    NaN NaN NaN
99  c3b20454-cd71-45bb-ab33-0b3ea37527fb    [{'label': 'View Nearby Homes', 'type': 'realE...   Best Public Elementary Schools in Wisconsin 37  1074    best-public-elementary-schools/s/wisconsin  https://d33a4decm84gsn.cloudfront.net/search/2...   https://d33a4decm84gsn.cloudfront.net/search/2...   43.089194   -87.883770  Atwater Elementary School   551380001809    Public  True    True    Public School   c3b20454-cd71-45bb-ab33-0b3ea37527fb    True    False   SHOREWOOD, WI   Atwater Elementary School   8b295479-c31f-47a9-83b8-94b2100e2832    3940b781-a9f6-4333-b607-6a6367e6af44    963a1085-efe7-45f5-81ee-d2bbf82a907c    900b6b9c-206e-4c34-82a8-247fee552b49    542c1289-ad69-4fc3-afab-bf91c1a6110e    False   True    Atwater Elementary School   [Shorewood School District, WI, PK, K-6]    School  atwater-elementary-school-shorewood-wi  45  [{'config': {'format': ['comma'], 'rounding': ...   NaN NaN NaN NaN NaN NaN [{'description': 'Based on quality of academic...   NaN NaN NaN NaN NaN NaN NaN NaN NaN © Mapbox    https://api.mapbox.com/styles/v1/niche-admin/c...   https://api.mapbox.com/styles/v1/niche-admin/c...   b523d19c-fdd1-497b-bd6d-ab394cde0dbf    © OpenStreetMap http://www.openstreetmap.org/copyright  https://www.mapbox.com/about/maps/  NaN NaN NaN NaN NaN 0.000000    0   [{'label': 'Virtual Tour', 'value': 'https://w...   NaN NaN NaN 84c36616-1b72-4d85-998d-c9795aadb726    84c36616-1b72-4d85-998d-c9795aadb726    NaN NaN NaN
100 rows × 73 columns

您可以通過調整范圍來獲取所有數據(最大記錄為 123)。 此外,您可能希望在請求之間添加一些暫停,否則您會被阻止。 如果您願意,也可以使用 Scrapy。

你需要仔細檢查HTML因為你可以在一個div中找到url

import scrapy


class NicheschoolsSpider(scrapy.Spider):
    name = 'nicheschools'
    allowed_domains = ['www.niche.com']
    start_urls = ['https://www.niche.com/k12/search/best-schools/s/wisconsin/']

    def parse(self, response):
        school_links = response.xpath("//div[@class='card ']/a/@href").extract()

        for link in school_links:
            yield response.follow(url=link, callback=self.parse_schools)

    def parse_schools(self, response):
        name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").extract_first()
        website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").extract_first()
        address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").extract_first()

        yield {
            'name': name,
            'link': response.url,
            'website': website,
            'address': address,
        }

結果在 json

{'name': 'Brookfield Academy', 'link': 'https://www.niche.com/k12/brookfield-academy-brookfield-wi/', 'website': 'https://www.brookfieldacademy.org', 'address': '3462 N BROOKFIELD RD'}
{'name': 'Wisconsin Lutheran High School', 'link': 'https://www.niche.com/k12/wisconsin-lutheran-high-school-milwaukee-wi/', 'website': 'https://www.wlhs.org', 'address': '330 N GLENVIEW AVE'}
{'name': 'Homestead High School', 'link': 'https://www.niche.com/k12/homestead-high-school-mequon-wi/', 'website': 'http://www.mtsd.k12.wi.us/homestead/', 'address': '5000 W MEQUON RD'}
{'name': 'Brookfield Central High School', 'link': 'https://www.niche.com/k12/brookfield-central-high-school-brookfield-wi/', 'website': 'https://www.elmbrookschools.org/brookfield-central-high-school', 'address': '16900 W GEBHARDT RD'}
{'name': 'Shorewood High School', 'link': 'https://www.niche.com/k12/shorewood-high-school-shorewood-wi/', 'website': 'https://www.shorewood.k12.wi.us/apps/pages/shs', 'address': '1701 E CAPITOL DR'}
{'name': 'School District of Waukesha', 'link': 'https://www.niche.com/k12/d/school-district-of-waukesha-wi/', 'website': 'https://sdw.waukesha.k12.wi.us', 'address': '222 MAPLE AVE'}
{'name': 'Pilgrim Park Middle School', 'link': 'https://www.niche.com/k12/pilgrim-park-middle-school-elm-grove-wi/', 'website': 'http://www.elmbrookschools.org/', 'address': '1500 PILGRIM PKWY'}
{'name': 'Marquette University High School', 'link': 'https://www.niche.com/k12/marquette-university-high-school-milwaukee-wi/', 'website': 'https://www.muhs.edu/', 'address': '3401 W WISCONSIN AVE'}
...

如果您是 web 抓取的新手,您需要小心不要過度訪問該站點,因為它們可能會阻止您,然后您需要解決驗證碼解決方案才能進入該站點。

此外,如果你想擴展你的知識,可以使用像Estela這樣的 web 集群,你可以在其中運行你的蜘蛛,還可以創建 cronjobs 每天都這樣做。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM