簡體   English   中英

來自所有搜索結果頁面的數據抓取信息

[英]Data scraping information from all of the search results pages

我正在嘗試從UCAS網站上抓取數據,以顯示來自基本搜索的所有頁面中的所有Uni名稱。

到目前為止,在沒有循環的情況下,它從第一頁顯示所有大學的名稱以及一些隨機信息,如下所示:

"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"

這是我的代碼:

require 'rubygems' 
require 'nokogiri'  
require 'open-uri'  
require 'mechanize'

mechanize = Mechanize.new

doc = mechanize.get('http://search.ucas.com/')

form = doc.forms.first

form['Vac'] = '2'  
form['AvailableIn'] = '2016'  
doc = form.submit
doc.search('li.results clearfix').each do |h3|  
  puts h3.text.strip  


  while a = doc.at('div.pagerclearfix a')   
    doc = Nokogiri::HTML(open(a[:href]))    
    doc.search('results clearfix').each do |h3|    
      puts h3.text.strip   

    end 
  end 
end

require 'rubygems'因為這是一種反模式。 您不需要像Mechanize一樣require 'nokogiri' ,也不需要OpenURI。

分頁不起作用,因為div.pagerclearfix選擇器不匹配任何內容,因為pagerclearfix是單獨的類。 另外, while循環位於錯誤的位置,它不應位於顯示結果的each循環中。

您最終應該是這樣的:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('http://search.ucas.com/')

form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'

page = form.submit

page.search('li.result h3').each do |h3|
  puts h3.text.strip
end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

  page.search('li.result h3').each do |h3|
    puts h3.text.strip
  end
end

您可以通過多種方式實現分頁,搜索“下一頁”鏈接通常是最簡單的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM