简体   繁体   English

使用Jsoup提取文本

[英]Extracting text with Jsoup

I am trying to get information from the following page: http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741 我正在尝试从以下页面获取信息: http : //fantasynews.cbssports.com/fantasyfootball/players/updates/187741

I need to get separate strings for each of these items: 我需要为每个项目获取单独的字符串:

  1. News Title 新闻标题
  2. News 新闻
  3. Analysis 分析

Right now I am able to get information from the whole table using: 现在,我可以使用以下方法从整个表中获取信息:

 doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/" + playerId).timeout(30000).get();
 Element title = doc.select("[id*=newsPage1]").first(); 

But the result of this is all of the articles run together. 但是,结果是所有文章都一起运行。

Can anyone advise? 有人可以建议吗?

Thanks Josh 谢谢乔希

You need to use more elaborate css selectors. 您需要使用更复杂的CSS选择器。 Maybe something like: 也许像:

public static void main(String[] args) {
  Pattern pat = Pattern.compile("(.*)News\\:\\p{Zs}(.*)Analysis\\:\\p{Zs}(.*)", Pattern.UNICODE_CASE);
  Document doc = null;
  try {
    doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741").userAgent("Mozilla").get();
  } catch (IOException e1) {
    e1.printStackTrace();
    System.exit(0);
  };

  Elements titles = doc.select("table h3");
  for (Element title : titles){
    Element td = title.parent();
    String innerTxt = td.text();
    Matcher mat = pat.matcher(innerTxt);
    if (mat.find()){
      System.out.println("titel = " + mat.group(1));
      System.out.println("news = " + mat.group(2));
      System.out.println("analysis = " + mat.group(3));
    }
  } 
}

I suggest you look into css selectors and the JSoup documentation . 我建议您研究一下CSS选择器和JSoup文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM