[英]Scraping Data from .ASPX Website URL with Python
我有一個 static.aspx url,我正在嘗試抓取。 我所有的嘗試都產生了常規網站的原始 html 數據,而不是我正在查詢的數據。
我的理解是我使用的標題(我從另一篇文章中找到)是正確且可概括的:
import urllib.request
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']
嘗試輸入表單數據不會導致任何事情發生:
formData = (
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR', viewstategen),
('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
('__EVENTTARGET', 'ct100$MainContent$calculate')
)
encodedFields = urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)
這給出了與“soup_dummy”變量幾乎完全相同的原始 html 代碼。 但我想看到的是正在提交的字段('ctl00_contentHolder_trSearchCharactersAPN','631091430000')的數據(這是“包裹號”框。
我非常感謝您的幫助。 如果有的話,將我鏈接到一個關於 HTML 請求的好帖子(一個不僅解釋而且實際上遍歷 aspx 的帖子)會很棒。
要使用包裹編號獲得結果,您的參數必須與您已經嘗試過的有所不同。 此外,您必須使用此 url https://www.mytaxcollector.com/trSearchProcess.aspx
來發送帖子請求。
工作代碼:
from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'
payload = {
'hidRedirect': '',
'hidGotoEstimate': '',
'txtStreetNumber': '',
'txtStreetName': '',
'cboStreetTag': '(Any Street Tag)',
'cboCommunity': '(Any City)',
'txtParcelNumber': '0108301010000', #your search term
'txtPropertyID': '',
'ctl00$contentHolder$cmdSearch': 'Search'
}
data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.