簡體   English   中英

用 BeautifulSoup 刮桌子

[英]Scraping a table w/ BeautifulSoup

我是新手,我已經和這張桌子打了幾個小時。 我試圖在即將舉行的會議上從參展商那里獲得一些信息,並想知道是否有人可以幫助我。

代碼:

profile = requests.get('https://annual.asaecenter.org/profile.cfm?profile_name=exhibitor&master_key=EF74CF1F-95BA-EC11-80F4-EC7F36E6C06A&inv_mast_key=93A17E5D-A46F-F21E-77E4-77B38A3B30EE')
soup = bs(profile.content, 'html.parser')

tds = soup.find_all("td")

print(tds)

輸出:

[<td style="width: 60%;">
                        Alpharetta Convention and Visitors Bureau
                </td>, <td style="width: 40%; text-align: right;">

                                 

                                Booth 2116
                </td>, <td class="tb-text-left" colspan="2">
<div>
</div>
</td>, <td style="width: 40%;">
                Alpharetta Convention and Visitors Bureau                                                           <br/>
</td>, <td style="width: 60%;" valign="top">
</td>, <td colspan="2"><b>Sales Contact</b><br/>
                Beth Brown<br/>
                Vice President of Sales
                </td>, <td colspan="2">
<a class="bttn-form bttn-form-default" href="javascript:Pops('http://www.awesomealpharetta.com','website',750,650)">
<i aria-hidden="true" class="fa fa-globe fa-lg" title="#xlink_label#"></i><br/>Website
                </a>
</td>, <td colspan="2">
                                Description
                        </td>, <td class="bgcolorw" colspan="2" valign="top">
     Alpharetta, GA has 30 hotels w/ 3,940 + guest rooms, 44,000 sq. ft. conference center, 200+ restaurants, 250+ shops, &amp; 40+ attractions for your attendees
        </td>, <td align="left" class="cellGrad" style="vertical-align:text-top; font-weight:bold; width: 175px">
<label for="TBE573973_4058_EC11_80F3_D9AE4409EDD7ID" id="ROW1780B7E3F-A0FD-41FD-BB77-FD8AD8F6356ELabel">Product Categories<label>
</label></label></td>, <td class="bgcolorw" colspan="1" valign="top">
<a href="/profile.cfm?profile_name=match_exhibitor&amp;answer_key=D6573973-4058-EC11-80F3-D9AE4409EDD7&amp;xtemplate">

期望的輸出:

name = Alpharetta Convention and Visitors Bureau
booth = 2116
url = http://www.awesomealpharetta.com
description = Alpharetta, GA has 30 hotels w/ 3,940 + guest rooms, 44,000 sq. ft. conference center, 200+ restaurants, 250+ shops, &amp; 40+ attractions for your attendees

這些是每個所需輸出的 ​​XPath 位置:

name = //*[@id="exhibitor-profile"]/tbody/tr[1]/td[1]
booth = //*[@id="exhibitor-profile"]/tbody/tr[1]/td[2]
website = //*[@id="exhibitor-profile"]/tbody/tr[3]/td/a
description = //*[@id="ROW1466DD5DF-0695-4D68-B221-4941A5171EAB"]/td

謝謝!!

要獲得所需的輸出,您可以嘗試:

import requests
import re
from bs4 import BeautifulSoup


response = requests.get(
    "https://annual.asaecenter.org/profile.cfm?profile_name=exhibitor&master_key=EF74CF1F-95BA-EC11-80F4-EC7F36E6C06A&inv_mast_key=93A17E5D-A46F-F21E-77E4-77B38A3B30EE"
)

soup = BeautifulSoup(response.text, "html.parser")
data = soup.select_one("div.profile_contianer")
all_data = data.get_text(strip=True, separator="|").split("|")
print(all_data[0])
print(all_data[1])
print(all_data[all_data.index("Description") + 1])
print(re.search(r"\('(.*?)'", str(data)).group(1))

打印:

Alpharetta Convention and Visitors Bureau
Booth 2116
Alpharetta, GA has 30 hotels w/ 3,940 + guest rooms, 44,000 sq. ft. conference center, 200+ restaurants, 250+ shops, & 40+ attractions for your attendees
http://www.awesomealpharetta.com

xpath 可能無法正常工作,因為編寫表格的人對多個表格使用了相同的id 這就是您的腳本可能失敗的原因。 這是獲取數據的另一種方法:

page_url = "https://annual.asaecenter.org/profile.cfm?profile_name=exhibitor&master_key=EF74CF1F-95BA-EC11-80F4-EC7F36E6C06A&inv_mast_key=93A17E5D-A46F-F21E-77E4-77B38A3B30EE"
response = requests.get(page_url).text
soup = BeautifulSoup(response, 'lxml')

tables = soup.find_all(id="exhibitor-profile")
tds = tables[0].find_all_next('td')
name = tds[0].text.strip()
booth = tds[1].text.strip()

url = tables[1].find_next('a').get('href').split("'")[1]
description = tables[2].find_next(id="ROW1466DD5DF-0695-4D68-B221-4941A5171EAB").find_next('td').text.strip()

print(name)
print(booth)
print(url)
print(description)

輸出:

Alpharetta Convention and Visitors Bureau
Booth 2116
http://www.awesomealpharetta.com
Alpharetta, GA has 30 hotels w/ 3,940 + guest rooms, 44,000 sq. ft. conference center, 200+ restaurants, 250+ shops, & 40+ attractions for your attendees

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM