简体   繁体   中英

Pagination in web scraping using jsoup in java swing

private void EducationWorld_Webscrap_jButtonActionPerformed(java.awt.event.ActionEvent evt)
{                                                                
     try
     {
         Document doc=Jsoup.connect("http://www.educationworld.in/institution/mumbai/schools").userAgent("Mozilla/17.0").get();
         Elements  links=doc.select("div.instnm.litblue_bg");
         StringBuilder sb1 = new StringBuilder ();
         links.stream().forEach(e->sb1.append(e.text()).append(System.getProperty("line.separator")));
         jTextArea1.setText(sb1.toString());
     }
     catch(Exception e)
     {
         JOptionPane.showMessageDialog(null, e);
     }
} 

This is showing data. But there is pagination. How to fetch data of next five pages?

Fortunately I've achieved what you're after, as you can see in the code block below. I've added comments that hopefully describe each step if you were not sure what is happening.

I tried playing around with the pagination settings of the site but they seem to only allow increments of 5 results per request so there wasn't much leeway, and you need to pass the starting point before it can retrieve the next 5 results.

Therefore, I've had to include it in a fori that loops 32 times. Which equates to 158 schools, divided by 5 equals 31.6 or rounded 32 Of course, if you only want the first 5 pages you can change the loop to loop only 5 times.

Anyway on to the juicy bit;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.*;
import java.net.*;

public class Loop
{
    public static void main( String[] args )
    {
        final StringBuilder sb1 = new StringBuilder();
        BufferedReader bufferedReader = null;
        OutputStream outputStream = null;

        try
        {
            // Parameter pagination counts
            int startCount = 0;
            int limitCount = 5;

            // Loop 32 times, 158 schools / 5 (pagination amount)
            for( int i = 0; i < 32; i++ )
            {
                // Open a connection to the supplied URL
                final URLConnection urlConnection = new URL( "http://www.educationworld.in/institution/mumbai/schools" ).openConnection();
                // Tell the URL we are sending output
                urlConnection.setDoOutput( true );
                // The stream we will be writing to the URL
                outputStream = urlConnection.getOutputStream();

                // Setup parameters for pagination
                final String params = "qstart=" + startCount + "&limit=" + limitCount;
                // Get the bytes of the pagination parameters
                final byte[] outputInBytes = params.getBytes( "UTF-8" );
                // Write the bytes to the URL
                outputStream.write( outputInBytes );

                // Get and read the URL response
                bufferedReader = new BufferedReader( new InputStreamReader( urlConnection.getInputStream() ) );
                StringBuilder response = new StringBuilder();
                String inputLine;

                // Loop over the response and read each line appending it to the StringBuilder
                while( (inputLine = bufferedReader.readLine()) != null )
                {
                    response.append( inputLine );
                }

                // Do the same as before just with a String instead
                final Document doc = Jsoup.parse( response.toString() );
                Elements links = doc.select( "div.instnm.litblue_bg" );
                links.forEach( e -> sb1.append( e.text() ).append( System.getProperty( "line.separator" ) ) );

                // Increment the pagination parameters
                startCount += 5;
                limitCount += 5;
            }

            System.out.println( sb1.toString() );
            jTextArea1.setText(sb1.toString());
        }
        catch( Exception e )
        {
            e.printStackTrace();
        }
        finally
        {
            try
            {
                // Close the bufferedReader
                if( bufferedReader != null )
                {
                    bufferedReader.close();
                }

                // Close the outputStream
                if( outputStream != null )
                {
                    outputStream.close();
                }
            }
            catch( IOException e )
            {
                e.printStackTrace();
            }
        }
    }
}

Hopefully this helps and you get the outcome you want, if you require anything describing just ask!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM