简体   繁体   English

无法使用jsoup从HTML解析值

[英]Unable to parse value from HTML using jsoup

I'm relatively new to using jsoup, and I can't seem to find the correct query to parse out the value I'm looking for. 我对使用jsoup相对较新,而且似乎找不到正确的查询来解析我正在寻找的值。 The HTML is as follows. HTML如下。

    <img src='http://rootzwiki.com/public/style_images/ginger/t_unread.png' alt='New Replies' /><br />

</a>
</td>
<td class='col_f_content '>



    <h4><a id="tid-link-12251" href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/" title='View topic, started  17 December 2011 - 09:32 AM' class='topic_title'>[ROM][LTE] RootzBoat 4.0.3 V6.1</a></h4>
    <br />
    <span class='desc lighter blend_links'>
        Started by <a hovercard-ref="member" hovercard-id="5" class="_hovertrigger url fn " href='http://rootzwiki.com/user/5-birdman/'>birdman</a>, 17 Dec 2011

    </span>

        <ul class='mini_pagination'>


                    <li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/" title='Go to page 1'>1</a></li>




                    <li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__10" title='Go to page 2'>2</a></li>




                    <li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__20" title='Go to page 3'>3</a></li>




                    <li><a href="http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__st__1990" title='Go to page 200'>200 &rarr;</a></li>


        </ul>

</td>
<td class='col_f_preview __topic_preview'>

        <a href='http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/' class='expander closed' title='Preview this topic'>&nbsp;</a>

</td>
<td class='col_f_views desc blend_links'>
    <ul>
        <li>



                    <span class='ipsBadge ipsBadge_orange'>Hot</span>&nbsp;

                <a href="http://rootzwiki.com/index.php?app=forums&amp;module=extras&amp;section=stats&amp;do=who&amp;t=12251" onclick="return ipb.forums.retrieveWhoPosted( 12251 );">1,999 replies</a>
        </li>
        <li class='views desc'>180,213 views</li>
    </ul>
</td>
<td class='col_f_post'>
    <a href='http://rootzwiki.com/user/49940-jakeday/' class='ipsUserPhotoLink left'>
        <img src='http://rootzwiki.com/uploads/profile/photo-thumb-49940.jpg' class='ipsUserPhoto ipsUserPhoto_mini' />
    </a>
    <ul class='last_post ipsType_small'>
        <li><a hovercard-ref="member" hovercard-id="49940" class="_hovertrigger url fn " href='http://rootzwiki.com/user/49940-jakeday/'>jakeday</a></li>
        <li>
            <a href='http://rootzwiki.com/topic/12251-romlte-rootzboat-403-v61/page__view__getlastpost' title='Go to last post'>Today, 04:20 AM</a>
        </li>                               
    </ul>
</td>

I need to parse out birdman from there. 我需要从那里解析birdman I know that once I've defined the element, I can get "birdman" out with author.text(); 我知道,一旦定义了元素,就可以使用author.text();获得“ author.text();author.text(); , but I cant figure out how to define the author element. ,但我不知道如何定义author元素。 I thought perhaps the following block of code would work, but as I mentioned, I'm pretty new to jsoup and html and it obviously didnt work. 我认为也许下面的代码块会起作用,但是正如我提到的那样,我对jsoup和html很陌生,但显然不起作用。 Theres nothing wrong with the connection, and jsoup is working for the other values I parsed out. 连接没有任何问题,并且jsoup正在为我解析出的其他值工作。

            TitleResults titleArray =  new TitleResults();
                Document doc = null;
                try {
                    doc = Jsoup.connect(Constants.FORUM).get();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                Elements threads = doc.select(".topic_title");
                for (Element thread : threads) {
                    titleArray =  new TitleResults();
                    //Thread title
                    threadTitle = thread.text();
                    titleArray.setItemName(threadTitle);

                    //Thread link
                    String threadStr = thread.attr("abs:href");
                    String endTag = "/page__view__getnewpost"; //trim link
                    threadStr = new String(threadStr.replace(endTag, ""));
                    threadArray.add(threadStr);

                    titleArray.setAuthorDate("Author/Date");
                    results.add(titleArray);
                }
                Elements authors = doc.select("a[hovercard-ref]");
                for (Element author : authors) {
                    if (author.attr("abs:href").contains("/user/")){
                        Log.d("POC", "SUCCESS " + author.attr("abs:href"));
                    } else {
                        Log.d("POC", "FAILURE " + author.text());                           
                    }
                }
        } 

I think you're thinking too hard ;) 我想你想得太辛苦了;)

To get the birdman portion of the link, just use the following: 要获取链接的birdman部分,只需使用以下命令:

Elements authors = doc.select("a");
for (Element author : authors) {
    Log.d("POC", author.text());
}

The "a" retrieves all links. "a"检索所有链接。 After that you can just use the .text() like you said to retrieve the value. 之后,您可以只使用.text()来获取值。

Selvin answered it in the comments. 塞尔文在评论中回答了。 I wasnt getting the source correctly and it was causing errors. 我没有正确获取源,并且导致了错误。 http://pastebin.com/xfUQkGw0 http://pastebin.com/xfUQkGw0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM