简体   繁体   English

使用jsoup解析论坛中的div标签

[英]Use jsoup to parse div tags in a forum

I am using the jSoup library in conjunction with Processing to retrieve certain text from a forum. 我将jSoup库与Processing结合使用,以从论坛中检索某些文本。 I am looking to only scrape posts from a certain user in a certain thread. 我只希望在某个线程中抓取某个用户的帖子。

These are div-tags containing username information and the posts: 这些是包含用户名信息和帖子的div标签:

username: 用户名:

<span itemprop="creator name" class="author vcard"><a hovercard-ref="member" hovercard-id="104291" data-ipb="noparse" class="url fn name  ___hover___member _hoversetup" href="[link to user's profile here]" title="" id="anonymous_element_4"><span itemprop="name">djrajio</span></a></span>

posts: 帖子:

<div itemprop="commentText" class="post entry-content ">[post text here]</div>

I tried following this tutorial but the selector syntax for div tags wasn't so clear to me. 我尝试按照教程进行操作,但是对div标签的选择器语法对我来说不太清楚。

Can someone point in the right direction to be able to just scrape texts from a specific user? 有人可以指出正确的方向,以便能够从特定用户处抓取文字吗?

Here is the html containing the two div tags: 是包含两个div标签的html:

try {

    ArrayList<String> arr = new ArrayList<String>();

    Document page = Jsoup.connect("http://illtellyoulater.com/div.txt").get();

    Elements posts = page.getElementsByAttributeValueStarting("id", "post_id_");

    for(Element post : posts) {
        if( post.getElementsByAttributeValue("itemprop", "creator name").get(0).text().trim().equals("djrajio") ) {
            arr.add(post.getElementsByAttributeValue("itemprop","commentText").get(0).text());
        }
    }


    System.out.println(arr.toString());
}catch(Exception e) {
    e.printStackTrace();
}

This is for just one page. 仅用于一页。 If you want to visit all the pages of the thread, or all the threads of the forum you will have to use a crawler. 如果要访问该主题的所有页面或论坛的所有主题,则必须使用搜寻器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM