[英]JSoup core web text extraction
I am new to JSoup, Sorry if my question is too trivial. 我是JSoup的新手,很抱歉,如果我的问题太琐碎了。 I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document I am not able to see any articles in the parsed output
我正在尝试从http://www.nytimes.com/提取文章文本,但是在打印分析文档时,我无法在分析输出中看到任何文章
public class App
{
public static void main( String[] args )
{
String url = "http://www.nytimes.com/";
Document document;
try {
document = Jsoup.connect(url).get();
System.out.println(document.html()); // Articles not getting printed
//System.out.println(document.toString()); // Same here
String title = document.title();
System.out.println("title : " + title); // Title is fine
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
ok I have tried to parse " http://en.wikipedia.org/wiki/Big_data " to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put. 好的,我尝试解析“ http://en.wikipedia.org/wiki/Big_data ”以检索Wiki数据,这里也存在同样的问题,但我没有得到输出的Wiki数据。 Any help or hint will be much appreciated.
任何帮助或提示将不胜感激。
Thanks. 谢谢。
Here's how to get all <p class="summary>
text: 以下是获取所有
<p class="summary>
文本的方法:
final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();
for( Element element : doc.select("p.summary") )
{
if( element.hasText() ) // Skip those tags without text
{
System.out.println(element.text());
}
}
If you need all <p>
tags, without any filtering, you can use doc.select("p")
instead. 如果需要所有
<p>
标记,而不进行任何过滤,则可以改用doc.select("p")
。 But in most cases it's better to select only those you need (see here for Jsoup Selector documentation). 但是在大多数情况下,最好只选择您需要的那些(请参阅此处以获取Jsoup Selector文档)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.