简体   繁体   English

使用Python的BeautifulSoup抓取表格ID

[英]BeautifulSoup scraping table id with python

I'm new to scraping, and am learning to use BeautifulSoup but I'm having trouble scraping a table. 我是新手,并且正在学习使用BeautifulSoup,但是在刮擦桌子时遇到了麻烦。 For the HTML I'm trying to parse: 对于HTML,我正在尝试解析:

<table id="ctl00_mainContent_DataList1" cellspacing="0" > style="width:80%;border-collapse:collapse;"> == $0
    <tbody>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        <tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
        ...

My code: 我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup

quote_page = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

table = soup.find('table', id="ctl00_mainContent_DataList1")
rows = table.findAll('tr')

I get AttributeError: 'NoneType' object has no attribute 'findAll' . 我得到AttributeError: 'NoneType' object has no attribute 'findAll' I'm using python 3.6 and jupyter notebook for this in case that matters. 我正在为此使用python 3.6和jupyter笔记本。

EDIT: The table data that I'm trying to parse only shows up on the page after requesting a search (In the city field, select Burnaby , and hit search). 编辑:我要解析的表数据仅在请求搜索后显示在页面上(在city字段中,选择Burnaby ,然后单击搜索)。 The table ctl00_mainContent_DataList1 is the list of dentists that shows up after the search is submitted. ctl00_mainContent_DataList1是提交搜索后显示的牙医列表。

First: I use requests because it is easier to work with cookies, headers, etc. 第一:我使用requests因为使用Cookie,标头等更加容易。


Page is generated by ASP.net and it sends values __VIEWSTATE , __VIEWSTATEGENERATOR , __EVENTVALIDATION which you have to send in POST request too. 网页是由产生ASP.net ,并将其发送值__VIEWSTATE__VIEWSTATEGENERATOR__EVENTVALIDATION ,你必须在发送POST请求了。

You have to load page using GET and then you can get those values. 您必须使用GET加载页面,然后才能获取这些值。
You can also use request.Session() to get cookies which can be needed too. 您也可以使用request.Session()来获取可能也需要的cookie。

Next you have to copy values and add parameters from form and send it using POST . 接下来,您必须复制值并从表单添加参数,然后使用POST发送它。

In code I put only parameters which are always send. 在代码中,我只放置了始终发送的参数。

'526' is code for Vancouver . '526'Vancouver代码。 Other codes you can find in <select> tag. 您可以在<select>标记中找到其他代码。
If you want other options then you may have to add other parameters. 如果需要其他选项,则可能必须添加其他参数。

ie. 即。 ctl00$mainContent$chkUndr4Ref: on is for Children: 3 & Under - Diagnose & Refer ctl00$mainContent$chkUndr4Ref: on适用于Children: 3 & Under - Diagnose & Refer

EDIT: because inside <tr> is <table> so find_all('tr') returns too many elements (external tr and internal tr ) and and later find_all('td') give the same td many times. I changed 编辑:因为<tr>内部是<table>所以find_all('tr')返回太多元素(外部tr和内部tr ), and later find_all('td') many times. I changed give the same td many times. I changed many times. I changed find_all('tr') into find_all('table')` and it should stop duplicate data. many times. I changed find_all('tr') many times. I changed into find_all('table')`,它应该停止重复数据。

import requests
from bs4 import BeautifulSoup

url = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'

# --- session ---

s = requests.Session() # to automatically copy cookies
#s.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'})

# --- GET request ---

# get page to get cookies and params
response = s.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# --- set params ---

params = {
    # session - copy from GET request
    #'EktronClientManager': '',
    #'__VIEWSTATE': '',
    #'__VIEWSTATEGENERATOR': '',
    #'__EVENTVALIDATION': '',
    # main options
    'ctl00$terms': '',
    'ctl00$mainContent$drpCity': '526',
    'ctl00$mainContent$txtPostalCode': '',
    'ctl00$mainContent$drpSpecialty': 'GP',
    'ctl00$mainContent$drpLanguage': '0',
    'ctl00$mainContent$drpSedation': '0',
    'ctl00$mainContent$btnSearch': '+Search+',
    # other options
    #'ctl00$mainContent$chkUndr4Ref': 'on',
}

# copy from GET request
for key in ['EktronClientManager', '__VIEWSTATE', '__VIEWSTATEGENERATOR', '__EVENTVALIDATION']:
    value = soup.find('input', id=key)['value']
    params[key] = value
    #print(key, ':', value)

# --- POST request ---

# get page with table - using params
response = s.post(url, data=params)#, headers={'Referer': url})
soup = BeautifulSoup(response.text, 'html.parser')

# --- data ---

table = soup.find('table', id='ctl00_mainContent_DataList1')

if not table:
    print('no table')
    #table = soup.find_all('table')
    #print('count:', len(table))
    #print(response.text)
else:   
    for row in table.find_all('table'):
        for column in row.find_all('td'):
            text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
            print(text)

    print('-----')

Part of result: 结果的一部分:

Map
Dr. Kashyap Vora, 6145 Fraser Street, Vancouver  V5W 2Z9
604 321 1869, www.voradental.ca
-----
Map
Dr. Niloufar Shirzad, Harbour Centre DentalL19 - 555 Hastings Street West, Vancouver  V6B 4N6
604 669 1195, www.harbourcentredental.com
-----
Map
Dr. Janice Brennan, 902 - 805 Broadway West, Vancouver  V5Z 1K1
604 872 2525
-----
Map
Dr. Rosemary Chang, 1240 Kingsway, Vancouver  V5V 3E1
604 873 1211
-----
Map
Dr. Mersedeh Shahabaldine, 3641 Broadway West, Vancouver  V6R 2B8
604 734 2114, www.westkitsdental.com
-----

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM