簡體   English   中英

刮除Div類HTML時如何在if循環內嵌套if語句

[英]How to Nest If Statement Within For Loop When Scraping Div Class HTML

以下是使用“美麗的湯”從該網頁上刮除醫生信息的刮板。 從下面的html代碼中可以看到,每個醫生在網頁上都有一個個人資料,其中顯示了醫生的姓名,診所,專業,分類和城市。

<div class="views-field views-field-title practitioner__name" ><a href="/practitioners/41824">Marilyn Adams</a></div>
              <div class="views-field views-field-field-pract-clinic practitioner__clinic" ><a href="/clinic/fortius-sport-health">Fortius Sport &amp; Health</a></div>
              <div class="views-field views-field-field-pract-profession practitioner__profession" >Physiotherapist</div>
              <div class="views-field views-field-taxonomy-vocabulary-5 practitioner__region" >Fraser River Delta</div>
              <div class="views-field views-field-city practitioner__city" ></div>

從示例html代碼中可以看到,醫師個人資料偶爾會丟失信息。 如果發生這種情況,我希望刮板打印“ N / A”。 我需要刮板打印“ N / A”,因為我最終希望將每個div類類別(名稱,診所,專業等)放入一個數組,其中每列的長度完全相同,以便我可以正確導出將數據保存到CSV文件。 這是我希望輸出與實際顯示相比的示例。

Actual            Expected

[Names]            [Names]
Greg               Greg
Bob                Bob

[Clinic]           [Clinic]
Sport/Health       Sport/Health
                   N/A

[Profession]       [Profession]
Physical Therapist  Physical Therapist
Physical Therapist  Physical Therapist

[Taxonomy]          [Taxonomy]
Fraser River        Fraser River
                    N/A

[City]              [City]
Vancouver           Vancouver
Vancouver           Vancouver

我曾嘗試編寫一個嵌套在每個for循環內的if語句,但是由於“ N / A”對於每個div類節僅顯示一次,因此代碼似乎無法正確循環。 有誰知道如何正確地將for語句嵌套在for循環中,以便在每列中獲取適當數量的“ N / As”? 提前致謝!

import requests
import re
from bs4 import BeautifulSoup

page=requests.get('https://sportmedbc.com/practitioners')
soup=BeautifulSoup(page.text, 'html.parser')

#Find Doctor Info

for doctor in soup.find_all('div',attrs={'class':'views-field views-field-title practitioner__name'}):
    for a in doctor.find_all('a'):
        print(a.text)

for clinic_name in soup.find_all('div',attrs={'class':'views-field views-field-field-pract-clinic practitioner__clinic'}):
    for b in clinic_name.find_all('a'):
        if b==(''):
            print('N/A')

profession_links=soup.findAll('div',attrs={'class':'views-field views-field-field-pract-profession practitioner__profession'})
for profession in profession_links:
    if profession.text==(''):
        print('N/A')
    print(profession.text)

taxonomy_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for taxonomy in taxonomy_links:
    if taxonomy.text==(''):
        print('N/A')
    print(taxonomy.text)

city_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for city in city_links:
    if city.text==(''):
        print('N/A')
    print(city.text)

對於此問題,您可以使用來自collections模塊的ChainMap此處的文檔 )。 這樣,您可以定義默認值,在這種情況下為'n/a'並且僅獲取每個醫生存在的信息:

from bs4 import BeautifulSoup
import requests
from collections import ChainMap

url = 'https://sportmedbc.com/practitioners'
soup = BeautifulSoup(requests.get(url).text, 'lxml')

def get_data(soup):
    default_data = {'name': 'n/a', 'clinic': 'n/a', 'profession': 'n/a', 'region': 'n/a', 'city': 'n/a'}

    for doctor in soup.select('.view-practitioners .practitioner'):
        doctor_data = {}

        if doctor.select_one('.practitioner__name').text.strip():
            doctor_data['name'] = doctor.select_one('.practitioner__name').text

        if doctor.select_one('.practitioner__clinic').text.strip():
            doctor_data['clinic'] = doctor.select_one('.practitioner__clinic').text

        if doctor.select_one('.practitioner__profession').text.strip():
            doctor_data['profession'] = doctor.select_one('.practitioner__profession').text

        if doctor.select_one('.practitioner__region').text.strip():
            doctor_data['region'] = doctor.select_one('.practitioner__region').text

        if doctor.select_one('.practitioner__city').text.strip():
            doctor_data['city'] = doctor.select_one('.practitioner__city').text

        yield ChainMap(doctor_data, default_data)

for doctor in get_data(soup):
    print('name:\t\t', doctor['name'])
    print('clinic:\t\t',doctor['clinic'])
    print('profession:\t',doctor['profession'])
    print('city:\t\t',doctor['city'])
    print('region:\t\t',doctor['region'])
    print('-' * 80)

印刷品:

name:        Jaimie Ackerman
clinic:      n/a
profession:  n/a
city:        n/a
region:      n/a
--------------------------------------------------------------------------------
name:        Marilyn Adams
clinic:      Fortius Sport & Health
profession:  Physiotherapist
city:        n/a
region:      Fraser River Delta
--------------------------------------------------------------------------------
name:        Mahsa Ahmadi
clinic:      Wellpoint Acupuncture (Sports Medicine)
profession:  Acupuncturist
city:        Vancouver
region:      Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name:        Tracie Albisser
clinic:      Pacific Sport Northern BC, Tracie Albisser
profession:  Strength and Conditioning Specialist, Exercise Physiologist
city:        n/a
region:      Cariboo - North East
--------------------------------------------------------------------------------
name:        Christine Alder
clinic:      n/a
profession:  n/a
city:        Vancouver
region:      Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name:        Steacy Alexander
clinic:      Go! Physiotherapy Sports and Wellness Centre
profession:  Physiotherapist
city:        Vancouver
region:      Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name:        Page Allison
clinic:      AET Clinic, .
profession:  Athletic Therapist
city:        Victoria
region:      Vancouver Island - Central Coast
--------------------------------------------------------------------------------
name:        Dana Alumbaugh
clinic:      n/a
profession:  Podiatrist
city:        Squamish
region:      Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name:        Manouch Amel
clinic:      Mountainview Kinesiology Ltd.
profession:  Strength and Conditioning Specialist
city:        Anmore
region:      Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name:        Janet Ames
clinic:      Dr. Janet Ames
profession:  Physician
city:        Prince George
region:      Cariboo - North East
--------------------------------------------------------------------------------
name:        Sandi Anderson
clinic:      n/a
profession:  n/a
city:        Coquitlam
region:      Fraser Valley
--------------------------------------------------------------------------------
name:        Greg Anderson
clinic:      University of the Fraser Valley
profession:  Exercise Physiologist
city:        Mission
region:      Fraser Valley
--------------------------------------------------------------------------------

編輯:

為了獲得列中的輸出,可以使用以下示例:

def print_data(header_text, data, key):
    print(header_text)
    for d in data:
        print(d[key])
    print()

data = list(get_data(soup))
print_data('[Names]', data, 'name')
print_data('[Clinic]', data, 'clinic')
print_data('[Profession]', data, 'profession')
print_data('[Taxonomy]', data, 'region')
print_data('[City]', data, 'city')

打印:

[Names]
Jaimie Ackerman
Marilyn Adams
Mahsa Ahmadi
Tracie Albisser
Christine Alder
Steacy Alexander
Page Allison
Dana Alumbaugh
Manouch Amel
Janet Ames
Sandi Anderson
Greg Anderson

[Clinic]
n/a
Fortius Sport & Health
Wellpoint Acupuncture (Sports Medicine)
Pacific Sport Northern BC, Tracie Albisser
n/a
Go! Physiotherapy Sports and Wellness Centre
AET Clinic, .
n/a
Mountainview Kinesiology Ltd.
Dr. Janet Ames
n/a
University of the Fraser Valley

[Profession]
n/a
Physiotherapist
Acupuncturist
Strength and Conditioning Specialist, Exercise Physiologist
n/a
Physiotherapist
Athletic Therapist
Podiatrist
Strength and Conditioning Specialist
Physician
n/a
Exercise Physiologist

[Taxonomy]
n/a
Fraser River Delta
Vancouver & Sea to Sky
Cariboo - North East
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Vancouver Island - Central Coast
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Cariboo - North East
Fraser Valley
Fraser Valley

[City]
n/a
n/a
Vancouver
n/a
Vancouver
Vancouver
Victoria
Squamish
Anmore
Prince George
Coquitlam
Mission

web怎么刮<p>里面的標簽</p><div>具有來自 HTML 的類/ID 的標簽,使用 Python</div><div id="text_translate"><p> 我想提取數據,例如</p><blockquote><p>發布日期:2016 年 6 月 16 日 漏洞標識符:APSB16-23 優先級:3 CVE 編號:CVE-2016-4126</p></blockquote><p> 來自<em><a href="https://helpx.adobe.com/security/products/air/apsb16-23.ug.html" rel="nofollow noreferrer">https://helpx.adobe.com/security/products/air/apsb16-23.ug.html</a></em></p><p> 編碼:</p><pre> import requests from bs4 import BeautifulSoup as bs from pprint import pprint r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html') soup = bs(r.content, 'html.parser') pprint([i.text for i in soup.select('div &gt;.text &gt; p', limit = 4 )] )</pre><p> output:</p><pre> ['Release date:\xa0September 13, 2016', 'Vulnerability identifier: APSB16-31', 'Priority: 3', 'CVE number:\xa0CVE-2016-6936']</pre><p> 問題是 /xa0。 我應該如何刪除它? 如果還有其他有效的代碼嗎? 我也想把它轉換成 CSV 文件。 謝謝你。</p></div>

[英]How to web scraping <p> tags inside <div> tags that has class/id from HTML using Python

我如何 Go 關於在一個<div and or <ul class? (nesting css selectors) python< div><div id="text_translate"><p> 我目前正在研究 PGA Tour 高爾夫球手的優秀程度以及他們與大多數高爾夫球手的區別。 PGATour.com 有一個統計頁面,顯示每場錦標賽的最新統計數據,一直到 1980 年。GIR、FIR、SS、UPD 等指標。</p><p> 我想要一個集中數據集中的所有這些統計數據,我已經完成了大約 50%</p><p> 這是我迄今為止嘗試過的代碼。</p><pre> from bs4 import BeautifulSoup, SoupStrainer import requests #Define the URL to extract from url = "https://www.pgatour.com/stats/categories.RAPP_INQ.html" page = requests.get(url) data = page.text soup = BeautifulSoup(data) for link in soup.find_all('a'): df = print(link.get('href'))</pre><p> 您可以自己運行它,但它會返回大量不干凈的數據以及拆分 URL,我在技術上可以 append。 問題是如何告訴 Python,“嘿,只是這些 URL,append 這些到“https://www.PGATour.com”</p><p> 簡化我的代碼的一種更有效的方法是簡單地嵌套 &lt;div class 我想從我想從中刮取的 URL 中獲取 URL。</p><p> 我會將 go 放入源代碼中還是簡單地從檢查元素頁面獲取所有 URL?</p><p> 為了提高效率,我寧願 go 與后者,但如果你能指出我如何學習如何做到這一點的正確方向,我將非常感激。</p><p> 我已經完成了大量的 Google 搜索,甚至還觀看了 Keith Galli 的網頁抓取視頻,但也許我只是在考慮了這個項目好幾天之后就需要睡覺了。 只是想結束它。</p><p> 太感謝了!</p></div></div>

[英]How Do I Go About Scraping All URL's Within a <div and or <ul class? (Nesting / CSS Selectors) Python

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何在抓取 html 頁面時找出正確的 div、class、跨度 如何在函數中嵌套if語句 如何在 python 3 的 while 循環中嵌套 if 語句 HTML 使用重復的 div class 名稱抓取網站 如何循環通過div類以訪問li類? 在循環期間刮取 html 時沒有表格 web怎么刮<p>里面的標簽</p><div>具有來自 HTML 的類/ID 的標簽,使用 Python</div><div id="text_translate"><p> 我想提取數據,例如</p><blockquote><p>發布日期:2016 年 6 月 16 日 漏洞標識符:APSB16-23 優先級:3 CVE 編號:CVE-2016-4126</p></blockquote><p> 來自<em><a href="https://helpx.adobe.com/security/products/air/apsb16-23.ug.html" rel="nofollow noreferrer">https://helpx.adobe.com/security/products/air/apsb16-23.ug.html</a></em></p><p> 編碼:</p><pre> import requests from bs4 import BeautifulSoup as bs from pprint import pprint r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html') soup = bs(r.content, 'html.parser') pprint([i.text for i in soup.select('div &gt;.text &gt; p', limit = 4 )] )</pre><p> output:</p><pre> ['Release date:\xa0September 13, 2016', 'Vulnerability identifier: APSB16-31', 'Priority: 3', 'CVE number:\xa0CVE-2016-6936']</pre><p> 問題是 /xa0。 我應該如何刪除它? 如果還有其他有效的代碼嗎? 我也想把它轉換成 CSV 文件。 謝謝你。</p></div> Web 抓取:無法使用 class 循環進入 div 元素以獲取文本和 URL 在 for 循環中從 web 抓取附加 output 時出現 IndexError 我如何 Go 關於在一個<div and or <ul class? (nesting css selectors) python< div><div id="text_translate"><p> 我目前正在研究 PGA Tour 高爾夫球手的優秀程度以及他們與大多數高爾夫球手的區別。 PGATour.com 有一個統計頁面,顯示每場錦標賽的最新統計數據,一直到 1980 年。GIR、FIR、SS、UPD 等指標。</p><p> 我想要一個集中數據集中的所有這些統計數據,我已經完成了大約 50%</p><p> 這是我迄今為止嘗試過的代碼。</p><pre> from bs4 import BeautifulSoup, SoupStrainer import requests #Define the URL to extract from url = "https://www.pgatour.com/stats/categories.RAPP_INQ.html" page = requests.get(url) data = page.text soup = BeautifulSoup(data) for link in soup.find_all('a'): df = print(link.get('href'))</pre><p> 您可以自己運行它,但它會返回大量不干凈的數據以及拆分 URL,我在技術上可以 append。 問題是如何告訴 Python,“嘿,只是這些 URL,append 這些到“https://www.PGATour.com”</p><p> 簡化我的代碼的一種更有效的方法是簡單地嵌套 &lt;div class 我想從我想從中刮取的 URL 中獲取 URL。</p><p> 我會將 go 放入源代碼中還是簡單地從檢查元素頁面獲取所有 URL?</p><p> 為了提高效率,我寧願 go 與后者,但如果你能指出我如何學習如何做到這一點的正確方向,我將非常感激。</p><p> 我已經完成了大量的 Google 搜索,甚至還觀看了 Keith Galli 的網頁抓取視頻,但也許我只是在考慮了這個項目好幾天之后就需要睡覺了。 只是想結束它。</p><p> 太感謝了!</p></div></div>
 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM