[英]Ruby Script unable to gather data
#!/usr/bin/ruby
# Fetches all Virginia Tech classes from the timetable and spits them out into a nice JSON object
# Can be run with option of which file to save output to or will save to classes.json by default
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'json'
#Create Mechanize Browser and Class Data hash to load data into
agent = Mechanize.new
classData = Hash.new
#Get Subjects from Timetable page
page = agent.get("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
subjects = page.forms.first.field_with(:name => 'subj_code').options
#Loop subjects
subjects.each do |subject|
#Get the Timetable Request page & Form
timetableSearch = agent.get("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
searchDetails = page.forms.first
#Submit with specific subject
searchDetails.set_fields({
:SUBJ_CODE => subject,
:TERMYEAR => '201401',
:CAMPUS => 0
})
#Submit the form and store results into courseListings
courseListings = Nokogiri::HTML(
searchDetails.submit(searchDetails.buttons[0]).body
)
#Create Array in Hash to store all classes for subjects
classData[subject] = []
#For every Class
courseListings.css('table.dataentrytable/tr').collect do |course|
subjectClassesDetails = Hash.new
#Map Table Cells for each course to appropriate values
[
[ :crn, 'td[1]/p/a/b/text()'],
[ :course, 'td[2]/font/text()'],
[ :title, 'td[3]/text()'],
[ :type, 'td[4]/p/text()'],
[ :hrs, 'td[5]/p/text()'],
[ :seats, 'td[6]/text()'],
[ :instructor, 'td[7]/text()'],
[ :days, 'td[8]/text()'],
[ :begin, 'td[9]/text()'],
[ :end, 'td[10]/text()'],
[ :location, 'td[11]/text()'],
# [ :exam, 'td[12]/text()']
].collect do |name, xpath|
#Not an additional time session (2nd row)
if (course.at_xpath('td[1]/p/a/b/text()').to_s.strip.length > 2)
subjectClassesDetails[name] = course.at_xpath(xpath).to_s.strip
end
end
#Add class to Array for Subject!
classData[subject].push(subjectClassesDetails)
end
end
#Write Data to JSON file
open(ARGV[0] || "classes.json", 'w') do |file|
file.print JSON.pretty_generate(classData)
end
上面的代码应该从https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest检索数据,但是如果我打印subject.length为0,则显然无法获取正确的数据。 给定的术语代码“ 201401”绝对正确。
我注意到,当我手动输入浏览器链接时,主题字段不允许您选择一个选项,直到选择了一个术语,但是当我查看页面源时,数据显然已经存在。 我该怎么做才能检索这些数据?
我正在看vtech页面,可以看到您需要先选择一个TERMYEAR,然后subj_code
填充subj_code
下拉列表,以获取选项。 不幸的是,这发生在function dropdownlist(listindex)
javascript中。 Mechanize无法处理javascript,因此该脚本注定会失败。
您的选择是运行诸如Watir或Selenium之类的浏览器自动化程序:在这里讨论: 如何使用Mechanize处理JavaScript?
或者阅读该页面的源代码并解析出这些行的值:
document.ttform.subj_code.options[0]=new Option("All Subjects","%",false, false);
document.ttform.subj_code.options[1]=new Option("AAEC - Agricultural and Applied Economics","AAEC",false, false);
document.ttform.subj_code.options[2]=new Option("ACIS - Accounting and Information Systems","ACIS",false, false);
获取选项。 您可以通过简单地使用open-uri
来做到这一点:
require 'open-uri'
page = open("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
page_source = page.read
现在,您可以使用正则表达式扫描所有选项:
page_source.scan /document\.ttform.+;/
这将为您提供一个数组,其中包含所有包含选项的javascript代码的行。 使您的正则表达式更好一些,然后可以从中提取选项文本。 我将查看是否可以为此提供一些建议,然后回发。 希望这将使您朝正确的方向前进。
我回来了。 我能够使用此正则表达式解析出所有subj_code选项:
subjects = page_source.scan(/Option\("(.*?)"/).uniq # remove duplicates
subjects.shift # get rid of the first option because it's just "All Subjects"
subjects.size == 137
希望能有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.