简体   繁体   中英

Pagination with Web Driver Selenium and JSoup

I'm developing an app that takes data from a website with JSoup. I was able to get the normal data.

But now I need to implement a pagination on it. I was told it would have to be with Web Driver, Selenium. But I do not know how to work with him, could someone tell me how I can do it?

public class MainActivity extends AppCompatActivity {

   private String url = "http://www.yudiz.com/blog/";
   private ArrayList<String> mAuthorNameList = new ArrayList<>();
   private ArrayList<String> mBlogUploadDateList = new ArrayList<>();
   private ArrayList<String> mBlogTitleList = new ArrayList<>();

   @Override
   protected void onCreate(Bundle savedInstanceState) {
       super.onCreate(savedInstanceState);
       setContentView(R.layout.activity_main);
       new Description().execute();

   }

   private class Description extends AsyncTask<Void, Void, Void> {

       @Override
       protected Void doInBackground(Void... params) {
           try {
               // Connect to the web site
               Document mBlogDocument = Jsoup.connect(url).get();
               // Using Elements to get the Meta data
               Elements mElementDataSize = mBlogDocument.select("div[class=author-date]");
               // Locate the content attribute
               int mElementSize = mElementDataSize.size();

               for (int i = 0; i < mElementSize; i++) {
                   Elements mElementAuthorName = mBlogDocument.select("span[class=vcard author post-author test]").select("a").eq(i);
                   String mAuthorName = mElementAuthorName.text();

                   Elements mElementBlogUploadDate = mBlogDocument.select("span[class=post-date updated]").eq(i);
                   String mBlogUploadDate = mElementBlogUploadDate.text();

                   Elements mElementBlogTitle = mBlogDocument.select("h2[class=entry-title]").select("a").eq(i);
                   String mBlogTitle = mElementBlogTitle.text();

                   mAuthorNameList.add(mAuthorName);
                   mBlogUploadDateList.add(mBlogUploadDate);
                   mBlogTitleList.add(mBlogTitle);
               }
           } catch (IOException e) {
               e.printStackTrace();
           }
           return null;
       }

       @Override
       protected void onPostExecute(Void result) {
           // Set description into TextView

           RecyclerView mRecyclerView = (RecyclerView)findViewById(R.id.act_recyclerview);

           DataAdapter mDataAdapter = new DataAdapter(MainActivity.this, mBlogTitleList, mAuthorNameList, mBlogUploadDateList);
           RecyclerView.LayoutManager mLayoutManager = new LinearLayoutManager(getApplicationContext());
           mRecyclerView.setLayoutManager(mLayoutManager);
           mRecyclerView.setAdapter(mDataAdapter);

       }
   }
}

Problem statement (as per my understanding): Scraper should be able to go to the next page until all pages are done using the pagination options available at the end of the blog page.

Now if we inspect the next button in the pagination, we can see the following html. a class="next_page" href="http://www.yudiz.com/blog/page/2/"

Now we need to instruct Jsoup to pick up this dynamic url in the next iteration of the loop to scrap data. This can be done using the following approach:

        String url = "http://www.yudiz.com/blog/";
        while (url!=null){
            try {
                Document doc = Jsoup.connect(url).get();
                url = null;
                System.out.println(doc.getElementsByTag("title").text());
                for (Element urls : doc.getElementsByClass("next_page")){
                    //perform your data extractions here.
                    url = urls != null ? urls.absUrl("href") : null;
                }               
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM