[英]Data scraping information from all of the search results pages
我正在嘗試從UCAS網站上抓取數據,以顯示來自基本搜索的所有頁面中的所有Uni名稱。
到目前為止,在沒有循環的情況下,它從第一頁顯示所有大學的名稱以及一些隨機信息,如下所示:
"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"
這是我的代碼:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'mechanize'
mechanize = Mechanize.new
doc = mechanize.get('http://search.ucas.com/')
form = doc.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
doc = form.submit
doc.search('li.results clearfix').each do |h3|
puts h3.text.strip
while a = doc.at('div.pagerclearfix a')
doc = Nokogiri::HTML(open(a[:href]))
doc.search('results clearfix').each do |h3|
puts h3.text.strip
end
end
end
您require 'rubygems'
因為這是一種反模式。 您不需要像Mechanize一樣require 'nokogiri'
,也不需要OpenURI。
分頁不起作用,因為div.pagerclearfix
選擇器不匹配任何內容,因為pager
和clearfix
是單獨的類。 另外, while
循環位於錯誤的位置,它不應位於顯示結果的each
循環中。
您最終應該是這樣的:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://search.ucas.com/')
form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
page = form.submit
page.search('li.result h3').each do |h3|
puts h3.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
page = mechanize.get(next_page_link['href'])
page.search('li.result h3').each do |h3|
puts h3.text.strip
end
end
您可以通過多種方式實現分頁,搜索“下一頁”鏈接通常是最簡單的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.