繁体   English   中英

来自所有搜索结果页面的数据抓取信息

[英]Data scraping information from all of the search results pages

我正在尝试从UCAS网站上抓取数据,以显示来自基本搜索的所有页面中的所有Uni名称。

到目前为止,在没有循环的情况下,它从第一页显示所有大学的名称以及一些随机信息,如下所示:

"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"

这是我的代码:

require 'rubygems' 
require 'nokogiri'  
require 'open-uri'  
require 'mechanize'

mechanize = Mechanize.new

doc = mechanize.get('http://search.ucas.com/')

form = doc.forms.first

form['Vac'] = '2'  
form['AvailableIn'] = '2016'  
doc = form.submit
doc.search('li.results clearfix').each do |h3|  
  puts h3.text.strip  


  while a = doc.at('div.pagerclearfix a')   
    doc = Nokogiri::HTML(open(a[:href]))    
    doc.search('results clearfix').each do |h3|    
      puts h3.text.strip   

    end 
  end 
end

require 'rubygems'因为这是一种反模式。 您不需要像Mechanize一样require 'nokogiri' ,也不需要OpenURI。

分页不起作用,因为div.pagerclearfix选择器不匹配任何内容,因为pagerclearfix是单独的类。 另外, while循环位于错误的位置,它不应位于显示结果的each循环中。

您最终应该是这样的:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('http://search.ucas.com/')

form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'

page = form.submit

page.search('li.result h3').each do |h3|
  puts h3.text.strip
end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

  page.search('li.result h3').each do |h3|
    puts h3.text.strip
  end
end

您可以通过多种方式实现分页,搜索“下一页”链接通常是最简单的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM