简体   繁体   中英

How to scrape ASP webpage in Python?

In this video, I give you a look at the dataset I want to scrape/take from the web. Very sorry about the audio, but did the best with what I have. It is hard for me to describe what I am trying to do as I see a page with thousands of pages and obviously has tables, but pd.read_html doesn't work! Until it hit me, this page has a form to be filled out first....

https://opir.fiu.edu/instructor_eval.asp

Going to this link will allow you to select a semester, and in doing so, will show thousands upon thousands of tables. I attempted to use the URL after selecting a semester hoping to read HTML, but no such luck.. I still don't know what I'm even looking at (like, is it a webpage, or is it ASP? What even IS ASP?). If you follow the video link, you'll see that it gives an ugly error if you select spring semester, copy the link, and put it in the search bar. Some SQL error. So this is my dilemma. I'm trying to GET this data... All these tables. Last post I made, I did a brute force attempt to get them by just clicking and dragging for 10+ minutes, then pasting into excel. That's an awful way of doing it, and it wasn't even particularly useful when I imported that excel sheet into python because the data was very difficult to work with. Very unstructured. So I thought, hey, why not scrape with bs4? Not that easy either, it seems, as the URL won't work. After filtering to spring semester, the URL just won't work, not for you, and not if you paste it into python for bs4 to use... So I'm sort of at a loss here of how to reasonably work with this data. I want to scrape it with bs4, and put it into dataframes to be manipulated later. However, as it is ASP or whatever it is, I can't find a way to do so yet :\\

ASP stands for Active Server Pages and is a page running a server-side script (usually vbs), so this shouldn't concern you as you want to scrape data from the rendered page.
In order to get a valid response from /instructor_evals/instr_eval_result.asp you have to submit a POST request with the form data of /instructor_eval.asp , otherwise the page returns an error message.
If you submit the correct data with urllib you should be able to get the tables with bs4 .

from urllib.request import urlopen, Request
from urllib.parse import urlencode
from bs4 import BeautifulSoup

url = 'https://opir.fiu.edu/instructor_evals/instr_eval_result.asp'
data = {'Term':'1171', 'Coll':'%', 'Dept':'','RefNum':'','Crse':'','Instr':''}
r = urlopen(Request(url, data=urlencode(data).encode()))
html = r.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')

By the way this error message is a strong indication that the page is vulnerable to SQL Injection which is a very nasty bug, and i think you should inform the admin about it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM