繁体   English   中英

Ruby Script无法收集数据

[英]Ruby Script unable to gather data

#!/usr/bin/ruby
# Fetches all Virginia Tech classes from the timetable and spits them out into a nice JSON object
# Can be run with option of which file to save output to or will save to classes.json by default
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'json'

#Create Mechanize Browser and Class Data hash to load data into
agent = Mechanize.new
classData = Hash.new

#Get Subjects from Timetable page
page = agent.get("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
subjects = page.forms.first.field_with(:name => 'subj_code').options

#Loop subjects
subjects.each do |subject|

#Get the Timetable Request page & Form
timetableSearch = agent.get("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
searchDetails = page.forms.first

#Submit with specific subject 
searchDetails.set_fields({
    :SUBJ_CODE => subject,
    :TERMYEAR => '201401',
    :CAMPUS => 0
})

#Submit the form and store results into courseListings
courseListings = Nokogiri::HTML(
    searchDetails.submit(searchDetails.buttons[0]).body
)

#Create Array in Hash to store all classes for subjects
classData[subject] = [] 

#For every Class
courseListings.css('table.dataentrytable/tr').collect do |course|

    subjectClassesDetails = Hash.new

    #Map Table Cells for each course to appropriate values
    [
        [ :crn, 'td[1]/p/a/b/text()'],
        [ :course, 'td[2]/font/text()'],
        [ :title, 'td[3]/text()'],
        [ :type, 'td[4]/p/text()'],
        [ :hrs, 'td[5]/p/text()'],
        [ :seats, 'td[6]/text()'],
        [ :instructor, 'td[7]/text()'],
        [ :days, 'td[8]/text()'],
        [ :begin, 'td[9]/text()'],
        [ :end, 'td[10]/text()'],
        [ :location, 'td[11]/text()'],
    #   [ :exam, 'td[12]/text()']
    ].collect do |name, xpath|
        #Not an additional time session (2nd row)
        if (course.at_xpath('td[1]/p/a/b/text()').to_s.strip.length > 2)
            subjectClassesDetails[name] = course.at_xpath(xpath).to_s.strip
        end
    end

    #Add class to Array for Subject!
    classData[subject].push(subjectClassesDetails)
end
end

#Write Data to JSON file
open(ARGV[0] || "classes.json", 'w') do |file| 
file.print JSON.pretty_generate(classData)
end

上面的代码应该从https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest检索数据,但是如果我打印subject.length为0,则显然无法获取正确的数据。 给定的术语代码“ 201401”绝对正确。

我注意到,当我手动输入浏览器链接时,主题字段不允许您选择一个选项,直到选择了一个术语,但是当我查看页面源时,数据显然已经存在。 我该怎么做才能检索这些数据?

我正在看vtech页面,可以看到您需要先选择一个TERMYEAR,然后subj_code填充subj_code下拉列表,以获取选项。 不幸的是,这发生在function dropdownlist(listindex) javascript中。 Mechanize无法处理javascript,因此该脚本注定会失败。

您的选择是运行诸如Watir或Selenium之类的浏览器自动化程序:在这里讨论: 如何使用Mechanize处理JavaScript?

或者阅读该页面的源代码并解析出这些行的值:

document.ttform.subj_code.options[0]=new Option("All Subjects","%",false, false);
document.ttform.subj_code.options[1]=new Option("AAEC - Agricultural and Applied Economics","AAEC",false, false);
document.ttform.subj_code.options[2]=new Option("ACIS - Accounting and Information Systems","ACIS",false, false);

获取选项。 您可以通过简单地使用open-uri来做到这一点:

require 'open-uri'
page = open("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
page_source = page.read

现在,您可以使用正则表达式扫描所有选项:

page_source.scan /document\.ttform.+;/

这将为您提供一个数组,其中包含所有包含选项的javascript代码的行。 使您的正则表达式更好一些,然后可以从中提取选项文本。 我将查看是否可以为此提供一些建议,然后回发。 希望这将使您朝正确的方向前进。

我回来了。 我能够使用此正则表达式解析出所有subj_code选项:

subjects = page_source.scan(/Option\("(.*?)"/).uniq # remove duplicates
subjects.shift # get rid of the first option because it's just "All Subjects"
subjects.size == 137

希望能有所帮助。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM