简体   繁体   中英

Use jsoup to parse div tags in a forum

I am using the jSoup library in conjunction with Processing to retrieve certain text from a forum. I am looking to only scrape posts from a certain user in a certain thread.

These are div-tags containing username information and the posts:

username:

<span itemprop="creator name" class="author vcard"><a hovercard-ref="member" hovercard-id="104291" data-ipb="noparse" class="url fn name  ___hover___member _hoversetup" href="[link to user's profile here]" title="" id="anonymous_element_4"><span itemprop="name">djrajio</span></a></span>

posts:

<div itemprop="commentText" class="post entry-content ">[post text here]</div>

I tried following this tutorial but the selector syntax for div tags wasn't so clear to me.

Can someone point in the right direction to be able to just scrape texts from a specific user?

Here is the html containing the two div tags:

try {

    ArrayList<String> arr = new ArrayList<String>();

    Document page = Jsoup.connect("http://illtellyoulater.com/div.txt").get();

    Elements posts = page.getElementsByAttributeValueStarting("id", "post_id_");

    for(Element post : posts) {
        if( post.getElementsByAttributeValue("itemprop", "creator name").get(0).text().trim().equals("djrajio") ) {
            arr.add(post.getElementsByAttributeValue("itemprop","commentText").get(0).text());
        }
    }


    System.out.println(arr.toString());
}catch(Exception e) {
    e.printStackTrace();
}

This is for just one page. If you want to visit all the pages of the thread, or all the threads of the forum you will have to use a crawler.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM