JSoup解析HTML问题

Question

I am trying to use JSoup to parse some html that looks, roughly, like it does below: 我正在尝试使用JSoup来解析一些看起来大致如下的html：

<div class="mod qmy_text withanno">
    <ul class="yfncnhl mytext"></ul>
    <h3>
        <span>Monday, August 12, 1999</span>
    </h3>
    <ul>
        <li><a href="some_link_here">Title of My Article</a><cite>News
                Source<span>(&nbsp;(Sun, Aug 12)</span>
        </cite></li>
    </ul>

My question is, how can I parse that HTML so I can return only what is in <cite> -- Sun, Aug 12. 我的问题是，如何解析HTML，以便仅返回<cite> -8月12日，星期日。

As of right now I am only able to output the date after h3 by using the expression 截至目前，我只能使用以下表达式输出h3之后的日期

Elements links = doc.select("div[class=mod qmy_text withanno] > h3");
    System.out.println(links.text());

Answer 1

Lets format your HTML to include only path to element you want to find. 让您格式化HTML，使其仅包含要查找的元素的路径。 It will look like 看起来像

<div class="mod qmy_text withanno">
    ...
    <ul>
        <li>
            ...
            <cite>News Source
                 <span>(&nbsp;(Sun, Aug 12)</span>
            </cite>
        </li>
    </ul>

So your select can look like div.mod.qmy_text.withanno > ul > li > cite > span . 因此，您的选择可能看起来像div.mod.qmy_text.withanno > ul > li > cite > span 。 So with code like 所以像这样的代码

Elements span = doc.select("div.mod.qmy_text.withanno > ul > li > cite > span");
String spanText = span.text();

Our spanText will contain ( (Sun, Aug 12) . 我们的spanText将包含( (Sun, Aug 12) 。

If you want to get only part between last ( and last ) to get Sun, Aug 12 you can use 如果您只想获得last ( and last )之间的一部分来获取Sun, Aug 12您可以使用

String date = spanText.substring(spanText.lastIndexOf('(')+1, spanText.lastIndexOf(')'));

JSoup解析HTML问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-09-07 17:33:34

JSoup解析HTML问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-09-07 17:33:34

解决方案1
0 已采纳 2015-09-07 17:33:34