简体   繁体   中英

Extract URLs from a class using Scrapy

I am trying to use scrapy to get a list of URLs from this website. I have the class of the div and I want all a tags in it.

here is the link for the website I am trying to get each URL for the profiles.

https://www.letsmakeaplan.org/find-a-cfp-professional?limit=10&pg=1&sort=random&distance=5

This is the code to try and pull the URLs from the page above

sel = Selector(text=driver.page_source)
books1 = sel.xpath("//div[@class='faceted-search-results-container-listing']/a/@herf").extract()

this comes back empty

This is is the code from the website

<<div class="faceted-search-results-container-listing" style="">
        <a href="/find-a-cfp-professional/certified-professional-profile/a9a0ca36-3c70-4ea4-a853-7f704fe4cc98" class="find-cfp-item js-card-link">
          <div class="find-cfp-item-top">
            <div class="h5 find-cfp-item-name">C. H. Simmons, CFP®</div>
            <div class="find-cfp-item-read-more"><span>view details</span></div>
          </div>

          <div class="find-cfp-item-bottom">
            <div class="find-cfp-item-column" data-column="1">
              <img src="https://login.cfp.net/eweb/photos/91475.jpg" data-default-img="/-/media/feature/cfp/lmapprofile/default-profile-avatar.jpeg" data-default-img-backup="/images/default-profile-avatar.jpeg" alt="C. Simmons Headshot" class="find-cfp-item-headshot" onerror="handleImg(this, event);">
              <div class="find-cfp-item-text">
                
      Simmons and Starzl Wealth Management<br>
      110 Bay St<br>
      Gadsden, AL 35901-5229<br>
    
              </div>
            </div>

            <div class="find-cfp-item-column" data-column="2">
              <div class="h6 find-cfp-item-column-heading">Planning Services Offered</div>
              <div class="find-cfp-item-text" data-line-clamp="4">
                Investment Planning, Retirement Planning
              </div>
            </div>

            <div class="find-cfp-item-column" data-column="3">
              <div class="find-cfp-item-column-inner">
                <div class="h6 find-cfp-item-column-heading">Client Focus</div>
                <div class="find-cfp-item-text" data-line-clamp="1">
                  None Provided
                </div>
              </div>

              <div class="find-cfp-item-column-inner">
                <div class="h6 find-cfp-item-column-heading">Minimum Investable Assets</div>
                <div class="find-cfp-item-text" data-line-clamp="1">
                  $500,000
                </div>
              </div>

              

            </div>
          </div>
        </a>

It looks like the search results come from an ajax call to an api in json format and rendered dynamically.

You can get all of the information from the raw json data if you scrape the api url instead...

scrapy.Request(url='https://www.letsmakeaplan.org/api/feature/lmapprofilesearch/search?limit=10&pg=1&sort=random&distance=5')
def parse(response):
    data = response.json()
    results = data["results"]
    links = [i["item_url"] for i in results]
    yield {'links': links}

output:

'/find-a-cfp-professional/certified-professional-profile/b1a27bac-77f0-4796-ab7f-7e15c19d8421'
'/find-a-cfp-professional/certified-professional-profile/e493f31f-88c7-4fdd-9863-9712ba85c95c'
'/find-a-cfp-professional/certified-professional-profile/2d634f05-331e-4699-b1a8-96e7a20aa0bf'
'/find-a-cfp-professional/certified-professional-profile/d9074216-7321-469f-b42f-2988d84d4a2b'
'/find-a-cfp-professional/certified-professional-profile/7f55e98c-df27-4922-b3a4-07c341a87f65'
'/find-a-cfp-professional/certified-professional-profile/1b0377a2-4545-45af-9ac4-18a8af2ffecd'
'/find-a-cfp-professional/certified-professional-profile/66b78e79-608b-4079-86c2-d9ae84c3a762'
'/find-a-cfp-professional/certified-professional-profile/e884f42b-8239-475a-b55f-5bb6f1130a36'
'/find-a-cfp-professional/certified-professional-profile/b00abd44-5969-4f02-a052-e6ef34b60e9b'
'/find-a-cfp-professional/certified-professional-profile/10ae9e9f-f11e-4f79-91c4-05f24e0c7a0e'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM