简体   繁体   English

Java jsoup链接提取(错误输出)

[英]Java jsoup link extracting(wrong output)

I am trying to get all the links in the <a class="subHover" but the thing is that with the code I wrote I get all the links in the page, here is my code: 我试图获取<a class="subHover"中的所有链接,但问题是,使用我编写的代码,我获取了页面中的所有链接,这是我的代码:

String website = "http://www.svensktnaringsliv.se/english/publications/?start=" +maxPage;
           Document docOne = Jsoup.connect(website).get();
           Elements elem = docOne.getElementsByAttributeValue("class", "search-result");
           Elements el = elem.attr("class", "subHover");
           System.out.println(el.select("a[href]"));

I dont really know where I am doing it wrong :/ The output of the code is: 我真的不知道我在哪里做错了:/代码的输出是:

<a href="http://www.svensktnaringsliv.se/english/publications/corporate-governance-internal-control-and-compliance-from-an-info_578545.html"> <img class="border" src="http://www.svensktnaringsliv.se/migration_catalog/Rapporter_och_opinionsmaterial/Rapporters/corporate_governance_10017apdf_579280.html/ALTERNATES/PORTRAIT_170/Corporate_Governance_10017a.pdf"> </a>
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/corporate-governance-internal-control-and-compliance-from-an-info_578545.html"> <h2> Corporate Governance, Internal Control and Compliance - - From an Information Security Perspective</h2> </a>
<a class="noHover" href="http://www.svensktnaringsliv.se/personer/christer-magnusson_538711.html"><span class="entypo entypo-user"></span><span>Christer Magnusson</span></a>
<a href="http://www.svensktnaringsliv.se/english/publications/from-stagnation-to-acceleration-proposed-guidelines-for-a-europea_595930.html"> <img class="border" src="http://www.svensktnaringsliv.se/migration_catalog/Rapporter_och_opinionsmaterial/Rapporter/proposed_guidelines_for_a_european_research_policypng_595932.html/ALTERNATES/PORTRAIT_170/Proposed_guidelines_for_a_European_research_policy.png"> </a>
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/from-stagnation-to-acceleration-proposed-guidelines-for-a-europea_595930.html"> <h2>From stagnation to acceleration - Proposed guidelines for a European research policy</h2> </a>
<a class="noHover" href="http://www.svensktnaringsliv.se/medarbetare/emil-gornerup_566685.html"><span class="entypo entypo-user"></span><span>Emil Görnerup</span></a>
<a href="http://www.svensktnaringsliv.se/english/publications/decision-usefulness-explored-an-investigation-of-capital-market-a_588531.html"> <img class="border" src="http://www.svensktnaringsliv.se/migration_catalog/decision-usefulness_omslagjpg_588538.html/ALTERNATES/PORTRAIT_170/Decision%20usefulness_omslag.jpg"> </a>
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/decision-usefulness-explored-an-investigation-of-capital-market-a_588531.html"> <h2>Decision usefulness explored - An investigation of capital market actors´ use of financial reports</h2> </a>
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/tax-reductions-and-public-resources_590643.html"> <h2>Tax reductions and public resources</h2> </a>
<a class="noHover" href="http://www.svensktnaringsliv.se/english/staff/mikael-witterblad_572108.html"><span class="entypo entypo-user"></span><span>Mikael Witterblad</span></a>
<a class="noHover" href="http://www.svensktnaringsliv.se/medarbetare/johan-fall_551949.html"><span class="entypo entypo-user"></span><span>Johan Fall</span></a>

The reason for your results is, that the document contains HTML like this: 结果的原因是该文档包含如下所示的HTML:

<div class="subHover"> 
 <span class="subject">PUBLICATION</span>
 <span class="subject-info"><b>Publicerad:</b> <time datetime="2005-06-30">30 June 2005 </time></span> 
 <div class="result-content clearfix"> 
  <a class="subHover" href="http://www.svensktnaringsliv.se/material/rapporter/internationell-utblick-loner-och-arbetskraftskostnader-juni-2005-_565749.html"> <h2>Internationell utblick - Löner och arbetskraftskostnader juni 2005 / International Outlook - Wages, Salaries, Labour Costs June 2005</h2> </a> 
  <div class="info-block"> 
   <p><a class="noHover" href="http://www.svensktnaringsliv.se/medarbetare/krister-b-andersson_560480.html"><span class="entypo entypo-user"></span><span>Krister B Andersson</span></a></p> 
  </div> 
 </div> 
</div>

You can see, that the outer div is of class subHover , which you pick up in your code. 您可以看到,外部div是subHover类的,您可以在代码中进行选择。 Later you select any inside a that has an attribute href , but you do not force the class of that a to be also subHover . 后来你选择任何内a有一个属性href ,但你不要强迫类的a也被subHover

Why don't you just use CSS selectors? 您为什么不只使用CSS选择器? This should work: 这应该工作:

String website = "http://www.svensktnaringsliv.se/english/publications/?start=" +maxPage;
Document docOne = Jsoup.connect(website).get();
Elements els = docOne.select("a.subHover");
for (Element el : els ){
  System.out.println(el);
}

I would recommend learning the power of CSS selectors, as described in the JSoup documentation . 我建议按照JSoup文档中的说明学习CSS选择器的功能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM