簡體   English   中英

在java swing中使用jsoup進行網頁抓取中的分頁

[英]Pagination in web scraping using jsoup in java swing

private void EducationWorld_Webscrap_jButtonActionPerformed(java.awt.event.ActionEvent evt)
{                                                                
     try
     {
         Document doc=Jsoup.connect("http://www.educationworld.in/institution/mumbai/schools").userAgent("Mozilla/17.0").get();
         Elements  links=doc.select("div.instnm.litblue_bg");
         StringBuilder sb1 = new StringBuilder ();
         links.stream().forEach(e->sb1.append(e.text()).append(System.getProperty("line.separator")));
         jTextArea1.setText(sb1.toString());
     }
     catch(Exception e)
     {
         JOptionPane.showMessageDialog(null, e);
     }
} 

這是顯示數據。 但是有分頁。 如何獲取接下來五頁的數據?

幸運的是,我已經實現了您所追求的目標,正如您在下面的代碼塊中所見。 如果您不確定發生了什么,我已經添加了希望描述每個步驟的注釋。

我嘗試使用站點的分頁設置,但它們似乎只允許每個請求增加 5 個結果,因此沒有太多余地,您需要通過起點才能檢索下 5 個結果。

因此,我不得不將它包含在循環32次的fori中。 等於158所學校,除以5等於31.6或四舍五入32當然,如果您只想要前5頁,您可以將循環更改為僅循環5次。

無論如何,多汁一點;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.*;
import java.net.*;

public class Loop
{
    public static void main( String[] args )
    {
        final StringBuilder sb1 = new StringBuilder();
        BufferedReader bufferedReader = null;
        OutputStream outputStream = null;

        try
        {
            // Parameter pagination counts
            int startCount = 0;
            int limitCount = 5;

            // Loop 32 times, 158 schools / 5 (pagination amount)
            for( int i = 0; i < 32; i++ )
            {
                // Open a connection to the supplied URL
                final URLConnection urlConnection = new URL( "http://www.educationworld.in/institution/mumbai/schools" ).openConnection();
                // Tell the URL we are sending output
                urlConnection.setDoOutput( true );
                // The stream we will be writing to the URL
                outputStream = urlConnection.getOutputStream();

                // Setup parameters for pagination
                final String params = "qstart=" + startCount + "&limit=" + limitCount;
                // Get the bytes of the pagination parameters
                final byte[] outputInBytes = params.getBytes( "UTF-8" );
                // Write the bytes to the URL
                outputStream.write( outputInBytes );

                // Get and read the URL response
                bufferedReader = new BufferedReader( new InputStreamReader( urlConnection.getInputStream() ) );
                StringBuilder response = new StringBuilder();
                String inputLine;

                // Loop over the response and read each line appending it to the StringBuilder
                while( (inputLine = bufferedReader.readLine()) != null )
                {
                    response.append( inputLine );
                }

                // Do the same as before just with a String instead
                final Document doc = Jsoup.parse( response.toString() );
                Elements links = doc.select( "div.instnm.litblue_bg" );
                links.forEach( e -> sb1.append( e.text() ).append( System.getProperty( "line.separator" ) ) );

                // Increment the pagination parameters
                startCount += 5;
                limitCount += 5;
            }

            System.out.println( sb1.toString() );
            jTextArea1.setText(sb1.toString());
        }
        catch( Exception e )
        {
            e.printStackTrace();
        }
        finally
        {
            try
            {
                // Close the bufferedReader
                if( bufferedReader != null )
                {
                    bufferedReader.close();
                }

                // Close the outputStream
                if( outputStream != null )
                {
                    outputStream.close();
                }
            }
            catch( IOException e )
            {
                e.printStackTrace();
            }
        }
    }
}

希望這會有所幫助,並且您會得到想要的結果,如果您需要任何描述,請詢問!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM