[英]JSoup Parsing HTML Questions
I am trying to use JSoup to parse some html that looks, roughly, like it does below: 我正在尝试使用JSoup来解析一些看起来大致如下的html:
<div class="mod qmy_text withanno">
<ul class="yfncnhl mytext"></ul>
<h3>
<span>Monday, August 12, 1999</span>
</h3>
<ul>
<li><a href="some_link_here">Title of My Article</a><cite>News
Source<span>( (Sun, Aug 12)</span>
</cite></li>
</ul>
My question is, how can I parse that HTML so I can return only what is in <cite>
-- Sun, Aug 12. 我的问题是,如何解析HTML,以便仅返回
<cite>
-8月12日,星期日。
As of right now I am only able to output the date after h3 by using the expression 截至目前,我只能使用以下表达式输出h3之后的日期
Elements links = doc.select("div[class=mod qmy_text withanno] > h3");
System.out.println(links.text());
Lets format your HTML to include only path to element you want to find. 让您格式化HTML,使其仅包含要查找的元素的路径。 It will look like
看起来像
<div class="mod qmy_text withanno">
...
<ul>
<li>
...
<cite>News Source
<span>( (Sun, Aug 12)</span>
</cite>
</li>
</ul>
So your select can look like div.mod.qmy_text.withanno > ul > li > cite > span
. 因此,您的选择可能看起来像
div.mod.qmy_text.withanno > ul > li > cite > span
。 So with code like 所以像这样的代码
Elements span = doc.select("div.mod.qmy_text.withanno > ul > li > cite > span");
String spanText = span.text();
Our spanText
will contain ( (Sun, Aug 12)
. 我们的
spanText
将包含( (Sun, Aug 12)
。
If you want to get only part between last (
and last )
to get Sun, Aug 12
you can use 如果您只想获得last
(
and last )
之间的一部分来获取Sun, Aug 12
您可以使用
String date = spanText.substring(spanText.lastIndexOf('(')+1, spanText.lastIndexOf(')'));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.