简体   繁体   English

在java swing中使用jsoup进行网页抓取中的分页

[英]Pagination in web scraping using jsoup in java swing

private void EducationWorld_Webscrap_jButtonActionPerformed(java.awt.event.ActionEvent evt)
{                                                                
     try
     {
         Document doc=Jsoup.connect("http://www.educationworld.in/institution/mumbai/schools").userAgent("Mozilla/17.0").get();
         Elements  links=doc.select("div.instnm.litblue_bg");
         StringBuilder sb1 = new StringBuilder ();
         links.stream().forEach(e->sb1.append(e.text()).append(System.getProperty("line.separator")));
         jTextArea1.setText(sb1.toString());
     }
     catch(Exception e)
     {
         JOptionPane.showMessageDialog(null, e);
     }
} 

This is showing data.这是显示数据。 But there is pagination.但是有分页。 How to fetch data of next five pages?如何获取接下来五页的数据?

Fortunately I've achieved what you're after, as you can see in the code block below.幸运的是,我已经实现了您所追求的目标,正如您在下面的代码块中所见。 I've added comments that hopefully describe each step if you were not sure what is happening.如果您不确定发生了什么,我已经添加了希望描述每个步骤的注释。

I tried playing around with the pagination settings of the site but they seem to only allow increments of 5 results per request so there wasn't much leeway, and you need to pass the starting point before it can retrieve the next 5 results.我尝试使用站点的分页设置,但它们似乎只允许每个请求增加 5 个结果,因此没有太多余地,您需要通过起点才能检索下 5 个结果。

Therefore, I've had to include it in a fori that loops 32 times.因此,我不得不将它包含在循环32次的fori中。 Which equates to 158 schools, divided by 5 equals 31.6 or rounded 32 Of course, if you only want the first 5 pages you can change the loop to loop only 5 times.等于158所学校,除以5等于31.6或四舍五入32当然,如果您只想要前5页,您可以将循环更改为仅循环5次。

Anyway on to the juicy bit;无论如何,多汁一点;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.*;
import java.net.*;

public class Loop
{
    public static void main( String[] args )
    {
        final StringBuilder sb1 = new StringBuilder();
        BufferedReader bufferedReader = null;
        OutputStream outputStream = null;

        try
        {
            // Parameter pagination counts
            int startCount = 0;
            int limitCount = 5;

            // Loop 32 times, 158 schools / 5 (pagination amount)
            for( int i = 0; i < 32; i++ )
            {
                // Open a connection to the supplied URL
                final URLConnection urlConnection = new URL( "http://www.educationworld.in/institution/mumbai/schools" ).openConnection();
                // Tell the URL we are sending output
                urlConnection.setDoOutput( true );
                // The stream we will be writing to the URL
                outputStream = urlConnection.getOutputStream();

                // Setup parameters for pagination
                final String params = "qstart=" + startCount + "&limit=" + limitCount;
                // Get the bytes of the pagination parameters
                final byte[] outputInBytes = params.getBytes( "UTF-8" );
                // Write the bytes to the URL
                outputStream.write( outputInBytes );

                // Get and read the URL response
                bufferedReader = new BufferedReader( new InputStreamReader( urlConnection.getInputStream() ) );
                StringBuilder response = new StringBuilder();
                String inputLine;

                // Loop over the response and read each line appending it to the StringBuilder
                while( (inputLine = bufferedReader.readLine()) != null )
                {
                    response.append( inputLine );
                }

                // Do the same as before just with a String instead
                final Document doc = Jsoup.parse( response.toString() );
                Elements links = doc.select( "div.instnm.litblue_bg" );
                links.forEach( e -> sb1.append( e.text() ).append( System.getProperty( "line.separator" ) ) );

                // Increment the pagination parameters
                startCount += 5;
                limitCount += 5;
            }

            System.out.println( sb1.toString() );
            jTextArea1.setText(sb1.toString());
        }
        catch( Exception e )
        {
            e.printStackTrace();
        }
        finally
        {
            try
            {
                // Close the bufferedReader
                if( bufferedReader != null )
                {
                    bufferedReader.close();
                }

                // Close the outputStream
                if( outputStream != null )
                {
                    outputStream.close();
                }
            }
            catch( IOException e )
            {
                e.printStackTrace();
            }
        }
    }
}

Hopefully this helps and you get the outcome you want, if you require anything describing just ask!希望这会有所帮助,并且您会得到想要的结果,如果您需要任何描述,请询问!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM