简体   繁体   中英

Data scraping information from all of the search results pages

I am trying to scrape data from the UCAS website to show all of the Uni names from all of the pages which come back from a basic search.

So far, without the loop working, it displays the names of all the universities from pages one as well as some random information, as can be seen below:

"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"

This is my code:

require 'rubygems' 
require 'nokogiri'  
require 'open-uri'  
require 'mechanize'

mechanize = Mechanize.new

doc = mechanize.get('http://search.ucas.com/')

form = doc.forms.first

form['Vac'] = '2'  
form['AvailableIn'] = '2016'  
doc = form.submit
doc.search('li.results clearfix').each do |h3|  
  puts h3.text.strip  


  while a = doc.at('div.pagerclearfix a')   
    doc = Nokogiri::HTML(open(a[:href]))    
    doc.search('results clearfix').each do |h3|    
      puts h3.text.strip   

    end 
  end 
end

You don't need to require 'rubygems' as that's an anti-pattern. You don't need to require 'nokogiri' as it's required by Mechanize, and you don't need OpenURI.

The pagination isn't working because the div.pagerclearfix selector doesn't match anything because pager and clearfix are separate classes. Also the while loop is in the wrong place, it shouldn't be inside the each loop which prints the results.

What you should end up with is something like this:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('http://search.ucas.com/')

form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'

page = form.submit

page.search('li.result h3').each do |h3|
  puts h3.text.strip
end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

  page.search('li.result h3').each do |h3|
    puts h3.text.strip
  end
end

There are a variety of ways you could implement the pagination, searching for the "next page" links is usually the most straightforward.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM