[英]Pagination in web scraping using jsoup in java swing
private void EducationWorld_Webscrap_jButtonActionPerformed(java.awt.event.ActionEvent evt)
{
try
{
Document doc=Jsoup.connect("http://www.educationworld.in/institution/mumbai/schools").userAgent("Mozilla/17.0").get();
Elements links=doc.select("div.instnm.litblue_bg");
StringBuilder sb1 = new StringBuilder ();
links.stream().forEach(e->sb1.append(e.text()).append(System.getProperty("line.separator")));
jTextArea1.setText(sb1.toString());
}
catch(Exception e)
{
JOptionPane.showMessageDialog(null, e);
}
}
This is showing data.这是显示数据。 But there is pagination.
但是有分页。 How to fetch data of next five pages?
如何获取接下来五页的数据?
Fortunately I've achieved what you're after, as you can see in the code block below.幸运的是,我已经实现了您所追求的目标,正如您在下面的代码块中所见。 I've added comments that hopefully describe each step if you were not sure what is happening.
如果您不确定发生了什么,我已经添加了希望描述每个步骤的注释。
I tried playing around with the pagination settings of the site but they seem to only allow increments of 5 results per request so there wasn't much leeway, and you need to pass the starting point before it can retrieve the next 5 results.我尝试使用站点的分页设置,但它们似乎只允许每个请求增加 5 个结果,因此没有太多余地,您需要通过起点才能检索下 5 个结果。
Therefore, I've had to include it in a fori
that loops 32
times.因此,我不得不将它包含在循环
32
次的fori
中。 Which equates to 158
schools, divided by 5
equals 31.6
or rounded 32
Of course, if you only want the first 5
pages you can change the loop to loop only 5
times.等于
158
所学校,除以5
等于31.6
或四舍五入32
当然,如果您只想要前5
页,您可以将循环更改为仅循环5
次。
Anyway on to the juicy bit;无论如何,多汁一点;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.*;
import java.net.*;
public class Loop
{
public static void main( String[] args )
{
final StringBuilder sb1 = new StringBuilder();
BufferedReader bufferedReader = null;
OutputStream outputStream = null;
try
{
// Parameter pagination counts
int startCount = 0;
int limitCount = 5;
// Loop 32 times, 158 schools / 5 (pagination amount)
for( int i = 0; i < 32; i++ )
{
// Open a connection to the supplied URL
final URLConnection urlConnection = new URL( "http://www.educationworld.in/institution/mumbai/schools" ).openConnection();
// Tell the URL we are sending output
urlConnection.setDoOutput( true );
// The stream we will be writing to the URL
outputStream = urlConnection.getOutputStream();
// Setup parameters for pagination
final String params = "qstart=" + startCount + "&limit=" + limitCount;
// Get the bytes of the pagination parameters
final byte[] outputInBytes = params.getBytes( "UTF-8" );
// Write the bytes to the URL
outputStream.write( outputInBytes );
// Get and read the URL response
bufferedReader = new BufferedReader( new InputStreamReader( urlConnection.getInputStream() ) );
StringBuilder response = new StringBuilder();
String inputLine;
// Loop over the response and read each line appending it to the StringBuilder
while( (inputLine = bufferedReader.readLine()) != null )
{
response.append( inputLine );
}
// Do the same as before just with a String instead
final Document doc = Jsoup.parse( response.toString() );
Elements links = doc.select( "div.instnm.litblue_bg" );
links.forEach( e -> sb1.append( e.text() ).append( System.getProperty( "line.separator" ) ) );
// Increment the pagination parameters
startCount += 5;
limitCount += 5;
}
System.out.println( sb1.toString() );
jTextArea1.setText(sb1.toString());
}
catch( Exception e )
{
e.printStackTrace();
}
finally
{
try
{
// Close the bufferedReader
if( bufferedReader != null )
{
bufferedReader.close();
}
// Close the outputStream
if( outputStream != null )
{
outputStream.close();
}
}
catch( IOException e )
{
e.printStackTrace();
}
}
}
}
Hopefully this helps and you get the outcome you want, if you require anything describing just ask!希望这会有所帮助,并且您会得到想要的结果,如果您需要任何描述,请询问!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.