简体   繁体   English

使用 Selenium 或 Beautiful Soup 抓取动态网站

[英]Scraping a Dynamic Website using Selenium or Beautiful Soup

I am trying to web scrape this dynamic website to get the course names and lecture time offered during a semester: https://www.utsc.utoronto.ca/registrar/timetable我正在尝试 web 抓取此动态网站以获取学期期间提供的课程名称和讲座时间: https://www.utsc.utoronto.ca/registrar/timetable

The problem is when you first enter the website there are no courses displayed yet, only after selecting the "Session(s)" and clicking "Search for Courses" will the courses start to show up.问题是当您第一次进入网站时,还没有显示任何课程,只有在选择“Session(s)”并单击“Search for Courses”后,课程才会开始显示。

Here is where the problems start :这是问题开始的地方

  1. I cannot do我不能做
html = urlopen(url).read()

using urllib.request, as it will only display the HTML of the page when there is nothing.使用 urllib.request,因为它只会在没有任何内容时显示页面的 HTML。

  1. I did quick search on how to webscrape dynamic website and run across a code like this:我快速搜索了如何抓取动态网站并运行如下代码:
import requests
url = 'https://www.utsc.utoronto.ca/registrar/timetable'

r= requests.get(url)
data = r.json()
print(data)

however, when I run this it returns "JSONDecodeError: Expecting value" and I have no idea why this occurs when it has worked on other dynamic websites.但是,当我运行它时,它会返回“JSONDecodeError: Expecting value”,我不知道为什么它在其他动态网站上工作时会发生这种情况。

I do not really have to use Selenium or Beautiful Soup so if there are better alternatives I will gladly try it.我真的不必使用 Selenium 或 Beautiful Soup,所以如果有更好的选择,我很乐意尝试。 Also I was wondering when:我也想知道什么时候:

html = urlopen(url).read()

what is the format of the html that is returned?返回的 html 的格式是什么? I want to know if I can just copy the changed HTML from inspecting the website after selecting the Session(s) and clicking search.我想知道我是否可以在选择会话并单击搜索后通过检查网站复制更改的 HTML。

ps: this is my first time using asking in stackoverflow, so please let me know if my question is not clear enough, etc. Sorry and thanks in advanced! ps:这是我第一次在stackoverflow中使用询问,所以如果我的问题不够清楚等,请告诉我。对不起,提前谢谢!

def render_page(url):
    driver = webdriver.Chrome(PATH)
    driver.get(url)
    r = driver.page_source
    driver.quit()
    return r

#render page using chrome driver and get all the html code on that certain webpage

def create_soup(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    return soup

You will need to use selenium for this if the content is loaded dynamically.如果内容是动态加载的,则需要为此使用 selenium。 Create a Beutiful Soup with the returned value from render_page() and see if you can manipulate the data there.使用 render_page() 的返回值创建一个 Beutiful Soup,看看您是否可以在那里操作数据。

you can use this code to get the data you need:您可以使用此代码获取所需的数据:

import requests

url = "https://www.utsc.utoronto.ca/regoffice/timetable/view/api.php"

# for winter session
payload = "coursecode=&sessions%5B%5D=20219&instructor=&courseTitle="

headers = {
  'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM