简体   繁体   English

使用 BeautifulSoup 和 Selenium 从表中抓取数据

[英]Scraping data from a table using BeautifulSoup and Selenium

I am trying to build an application that scrapes course information from a universities course catalogue and then constructs a few possible schedules a student could choose.我正在尝试构建一个应用程序,从大学课程目录中获取课程信息,然后构建学生可以选择的一些可能的时间表。 The course catalogue url doesn't change each time a new course is searched for which is why I am using Selenium to automatically search for a course catalogue then Beautiful Soup to scrape the information.每次搜索新课程时,课程目录 url 都不会改变,这就是为什么我使用 Selenium 自动搜索课程目录,然后使用 Beautiful Soup 来抓取信息。 This is my first time using Beautiful Soup and Selenium so apologies in advance if the solution is quite simple.这是我第一次使用 Beautiful Soup 和 Selenium,如果解决方案很简单,请提前道歉。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests

URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
response = requests.get(URL);
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all("tbody")
print(table)

Currently, when I print(table) it prints two objects.目前,当我print(table)它会打印两个对象。 One, (first picture) has the general information regarding the course (information I don't need to scrape).一,(第一张图片)有关于课程的一般信息(我不需要抓取的信息)。 The second object is empty.第二个对象是空的。 As far as I can tell there are only two tables on the website, both pictured below.据我所知,网站上只有两张桌子,如下图所示。 The second one is the one I am interested scraping but for some reason the second element in table is empty.第二个是我感兴趣的,但由于某种原因, table的第二个元素是空的。

在此处输入图片说明

The information is pictured below is the information I am trying to scrape.下图所示的信息是我试图抓取的信息。

这是我试图抓取的信息

Output from the print(table) print(table)输出

<tbody>
   \n
   <tr>
      <th scope="row">Hours</th>
      <td id="courseCredits"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Prerequisites</th>
      <td id="coursePrereqs"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Recommended</th>
      <td id="courseRec"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Offered</th>
      <td id="courseOffered"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Headers</th>
      <td id="courseHeaders"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Note</th>
      <td id="courseNote"></td>
   </tr>
   \n
   <tr>
      <th scope="row">When\xa0Taught</th>
      <td id="courseWhenTaught"></td>
   </tr>
   \n
</tbody>
, 
<tbody></tbody>
]

Here's a technique for parsing tables like that:这是一种解析表的技术:

from requests import get
for js in ["http://code.jquery.com/jquery-1.11.3.min.js", "https://cdn.jsdelivr.net/npm/table-to-json@0.13.0/lib/jquery.tabletojson.min.js"]:
  body = get(js).content.decode('utf8')
  driver.execute_script(body)

data = driver.execute_script("return $('table#sectionTable').tableToJSON()")

在 Repl.it 上运行

This is pretty easy with just Selenium:仅使用 Selenium 这很容易:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)

# get table
table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[@id='sectionTable']")))

# iterate rows and cells
rows = table.find_elements_by_xpath("//tr")
for row in rows:

    # get cells
    cells = row.find_elements_by_tag_name("td")

    # iterate cells
    for cell in cells:
        print(cell.text)

Hopefully this gets you started.希望这能让你开始。

I just leave it here if you want solution without Selenium, using only requests module:如果你想要没有 Selenium 的解决方案,我只是把它留在这里,只使用requests模块:

import json
import requests

url_classes = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getClasses.php'
url_sections = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getSections.php'

data_classes = {
    'searchObject[yearterm]':'20195',
    'searchObject[dept_name]':'C S',
    'searchObject[catalog_number]':'142',
    'sessionId':''
}

data_sections = {
    'courseId':'',
    'sessionId':'',
    'yearterm':'20195',
    'no_outcomes':'true'
}

classes = requests.post(url_classes, data=data_classes).json()
data_sections['courseId'] = next(iter(classes))
sections = requests.post(url_sections, data=data_sections).json()

# print(json.dumps(sections, indent=4)) # <-- uncomment this to see all data
# print(json.dumps(classes, indent=4))

for section in sections['sections']:
    print(section)
    print('-' * 80)

This prints all sections (but there's more data if you uncomment the print statements):这将打印所有部分(但如果您取消注释打印语句,则会有更多数据):

{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '001', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '001', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '0900', 'end_time': '0950', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '51', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '002', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '1000', 'end_time': '1050', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '34', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------

...and so on.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM