简体   繁体   中英

Scraping Data from .ASPX Website URL with Python

I have a static.aspx url that I am trying to scrape. All of my attempts yield the raw html data of the regular website instead of the data I am querying.

My understanding is the headers I am using (which I found from another post) are correct and generalizable:

import urllib.request
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.request.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']

Trying to enter the form data causes nothing to happen:

formData = (
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEGENERATOR', viewstategen),
    ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
    ('__EVENTTARGET', 'ct100$MainContent$calculate')
)

encodedFields =  urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)


soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)

This give raw html code almost exactly the same as the "soup_dummy" variable. But what I want to see is the data of the field ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000') being submitted (this is the "parcel number" box.

I would really appreciate the help. If anything, linking me to a good post about HTML requests (one that not only explains but actually walks through scraping aspx) would be great.

To get the result using the parcel number, your parameters have to be somewhat different from what you have already tried with. Moreover, you have to use this url https://www.mytaxcollector.com/trSearchProcess.aspx to send the post requests.

Working code:

from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup

url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'

payload = {
    'hidRedirect': '',
    'hidGotoEstimate': '',
    'txtStreetNumber': '',
    'txtStreetName': '',
    'cboStreetTag': '(Any Street Tag)',
    'cboCommunity': '(Any City)',
    'txtParcelNumber': '0108301010000',  #your search term
    'txtPropertyID': '',
    'ctl00$contentHolder$cmdSearch': 'Search'
}

data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
    data = [item.get_text(strip=True) for item in items.select("td")]
    print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM