使用Python的BeautifulSoup抓取表格ID

Question

我是新手，並且正在學習使用BeautifulSoup，但是在刮擦桌子時遇到了麻煩。 對於HTML，我正在嘗試解析：

<table id="ctl00_mainContent_DataList1" cellspacing="0" > style="width:80%;border-collapse:collapse;"> == $0
    <tbody>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        ...

我的代碼：

from urllib.request import urlopen
from bs4 import BeautifulSoup

quote_page = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

table = soup.find('table', id="ctl00_mainContent_DataList1")
rows = table.findAll('tr')

我得到AttributeError: 'NoneType' object has no attribute 'findAll' 。 我正在為此使用python 3.6和jupyter筆記本。

編輯：我要解析的表數據僅在請求搜索后顯示在頁面上（在city字段中，選擇Burnaby ，然后單擊搜索）。 表ctl00_mainContent_DataList1是提交搜索后顯示的牙醫列表。

Answer 1

第一：我使用requests因為使用Cookie，標頭等更加容易。

網頁是由產生ASP.net ，並將其發送值__VIEWSTATE ， __VIEWSTATEGENERATOR ， __EVENTVALIDATION ，你必須在發送POST請求了。

您必須使用GET加載頁面，然后才能獲取這些值。
您也可以使用request.Session()來獲取可能也需要的cookie。

接下來，您必須復制值並從表單添加參數，然后使用POST發送它。

在代碼中，我只放置了始終發送的參數。

'526'是Vancouver代碼。 您可以在<select>標記中找到其他代碼。
如果需要其他選項，則可能必須添加其他參數。

即。 ctl00$mainContent$chkUndr4Ref: on適用於Children: 3 & Under - Diagnose & Refer

編輯：因為<tr>內部是<table>所以find_all('tr')返回太多元素（外部tr和內部tr ）， and later find_all（'td'） many times. I changed give the same td many times. I changed many times. I changed find_all（'tr'） many times. I changed into find_all（'table'）`，它應該停止重復數據。

import requests
from bs4 import BeautifulSoup

url = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'

# --- session ---

s = requests.Session() # to automatically copy cookies
#s.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'})

# --- GET request ---

# get page to get cookies and params
response = s.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# --- set params ---

params = {
    # session - copy from GET request
    #'EktronClientManager': '',
    #'__VIEWSTATE': '',
    #'__VIEWSTATEGENERATOR': '',
    #'__EVENTVALIDATION': '',
    # main options
    'ctl00$terms': '',
    'ctl00$mainContent$drpCity': '526',
    'ctl00$mainContent$txtPostalCode': '',
    'ctl00$mainContent$drpSpecialty': 'GP',
    'ctl00$mainContent$drpLanguage': '0',
    'ctl00$mainContent$drpSedation': '0',
    'ctl00$mainContent$btnSearch': '+Search+',
    # other options
    #'ctl00$mainContent$chkUndr4Ref': 'on',
}

# copy from GET request
for key in ['EktronClientManager', '__VIEWSTATE', '__VIEWSTATEGENERATOR', '__EVENTVALIDATION']:
    value = soup.find('input', id=key)['value']
    params[key] = value
    #print(key, ':', value)

# --- POST request ---

# get page with table - using params
response = s.post(url, data=params)#, headers={'Referer': url})
soup = BeautifulSoup(response.text, 'html.parser')

# --- data ---

table = soup.find('table', id='ctl00_mainContent_DataList1')

if not table:
    print('no table')
    #table = soup.find_all('table')
    #print('count:', len(table))
    #print(response.text)
else:   
    for row in table.find_all('table'):
        for column in row.find_all('td'):
            text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
            print(text)

    print('-----')

結果的一部分：

Map
Dr. Kashyap Vora, 6145 Fraser Street, Vancouver  V5W 2Z9
604 321 1869, www.voradental.ca
-----
Map
Dr. Niloufar Shirzad, Harbour Centre DentalL19 - 555 Hastings Street West, Vancouver  V6B 4N6
604 669 1195, www.harbourcentredental.com
-----
Map
Dr. Janice Brennan, 902 - 805 Broadway West, Vancouver  V5Z 1K1
604 872 2525
-----
Map
Dr. Rosemary Chang, 1240 Kingsway, Vancouver  V5V 3E1
604 873 1211
-----
Map
Dr. Mersedeh Shahabaldine, 3641 Broadway West, Vancouver  V6R 2B8
604 734 2114, www.westkitsdental.com
-----

使用Python的BeautifulSoup抓取表格ID

問題描述

1 個解決方案

解決方案1
2 已采納 2018-01-03 10:02:25

使用Python的BeautifulSoup抓取表格ID

問題描述

1 個解決方案

解決方案1 2 已采納 2018-01-03 10:02:25

解決方案1
2 已采納 2018-01-03 10:02:25