简体   繁体   English

使用Web Driver Selenium和JSoup分页

[英]Pagination with Web Driver Selenium and JSoup

I'm developing an app that takes data from a website with JSoup. 我正在开发一个使用JSoup从网站获取数据的应用程序。 I was able to get the normal data. 我能够获得正常数据。

But now I need to implement a pagination on it. 但是现在我需要对其进行分页。 I was told it would have to be with Web Driver, Selenium. 有人告诉我,必须与Web Driver Selenium一起使用。 But I do not know how to work with him, could someone tell me how I can do it? 但是我不知道如何与他合作,有人可以告诉我我该怎么做吗?

public class MainActivity extends AppCompatActivity {

   private String url = "http://www.yudiz.com/blog/";
   private ArrayList<String> mAuthorNameList = new ArrayList<>();
   private ArrayList<String> mBlogUploadDateList = new ArrayList<>();
   private ArrayList<String> mBlogTitleList = new ArrayList<>();

   @Override
   protected void onCreate(Bundle savedInstanceState) {
       super.onCreate(savedInstanceState);
       setContentView(R.layout.activity_main);
       new Description().execute();

   }

   private class Description extends AsyncTask<Void, Void, Void> {

       @Override
       protected Void doInBackground(Void... params) {
           try {
               // Connect to the web site
               Document mBlogDocument = Jsoup.connect(url).get();
               // Using Elements to get the Meta data
               Elements mElementDataSize = mBlogDocument.select("div[class=author-date]");
               // Locate the content attribute
               int mElementSize = mElementDataSize.size();

               for (int i = 0; i < mElementSize; i++) {
                   Elements mElementAuthorName = mBlogDocument.select("span[class=vcard author post-author test]").select("a").eq(i);
                   String mAuthorName = mElementAuthorName.text();

                   Elements mElementBlogUploadDate = mBlogDocument.select("span[class=post-date updated]").eq(i);
                   String mBlogUploadDate = mElementBlogUploadDate.text();

                   Elements mElementBlogTitle = mBlogDocument.select("h2[class=entry-title]").select("a").eq(i);
                   String mBlogTitle = mElementBlogTitle.text();

                   mAuthorNameList.add(mAuthorName);
                   mBlogUploadDateList.add(mBlogUploadDate);
                   mBlogTitleList.add(mBlogTitle);
               }
           } catch (IOException e) {
               e.printStackTrace();
           }
           return null;
       }

       @Override
       protected void onPostExecute(Void result) {
           // Set description into TextView

           RecyclerView mRecyclerView = (RecyclerView)findViewById(R.id.act_recyclerview);

           DataAdapter mDataAdapter = new DataAdapter(MainActivity.this, mBlogTitleList, mAuthorNameList, mBlogUploadDateList);
           RecyclerView.LayoutManager mLayoutManager = new LinearLayoutManager(getApplicationContext());
           mRecyclerView.setLayoutManager(mLayoutManager);
           mRecyclerView.setAdapter(mDataAdapter);

       }
   }
}

Problem statement (as per my understanding): Scraper should be able to go to the next page until all pages are done using the pagination options available at the end of the blog page. 问题陈述(据我了解):刮板应该能够转到下一页,直到使用博客页面末尾可用的分页选项完成所有页面为止。

Now if we inspect the next button in the pagination, we can see the following html. 现在,如果我们检查分页中的下一个按钮,则可以看到以下html。 a class="next_page" href="http://www.yudiz.com/blog/page/2/" 一个class =“ next_page” href =“ http://www.yudiz.com/blog/page/2/”

Now we need to instruct Jsoup to pick up this dynamic url in the next iteration of the loop to scrap data. 现在,我们需要指示Jsoup在循环的下一个迭代中选择此动态url,以抓取数据。 This can be done using the following approach: 可以使用以下方法完成此操作:

        String url = "http://www.yudiz.com/blog/";
        while (url!=null){
            try {
                Document doc = Jsoup.connect(url).get();
                url = null;
                System.out.println(doc.getElementsByTag("title").text());
                for (Element urls : doc.getElementsByClass("next_page")){
                    //perform your data extractions here.
                    url = urls != null ? urls.absUrl("href") : null;
                }               
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM