繁体   English   中英

使用Jsoup抓取

[英]scraping using Jsoup

我正在使用Jsoup抓取一个电子商务网站。 在这种情况下,我想获取诸如和价格的标签。 在Jsoup.parse()之后,我无法得到这个。

<div id="ctl00_ContentPlaceHolder1_ctl00_ctl03_Showcase">
 <div class="controlcontent_r">
  <div class="bucketgroup">
   <div class="prod_viewsparent">
    <div class="bucket" style="width: 175px; height: 280px;">
     <div class="bucket_left">
      <a href="/Products/Buy-Online-Electronics-Cameras-Digital-Cameras/Nikon/Nikon-Coolpix-L27-Point--Shoot/pid-2849731.aspx">
       <img class="mtb-img" style="width: 150px; height: 150px;" src="http://resources-images.martjackhosting.com/s3/martjack-resources/5d4b3aa1-119a-4d82-b9bb-1b6bdbd62002/Images/ProductImages/Source/NikonL27-BLK.jpg;width=150;height=150;scale=canvas" alt="Nikon Coolpix L27 Point & Shoot" title="Digital Cameras, Nikon, Nikon Coolpix L27 Point & Shoot"></a>
      <div id="2849731" class="btn_quick_view" style="display:none">
      <a rel="2849731,0,2466375,5d4b3aa1-119a-4d82-b9bb-1b6bdbd62002" href="#">Quick View</a></div>
   <h4 class="mtb-title">Nikon Coolpix L27 Point & Shoot</h4>
    <div class="mtb-desc">
      <span class="mtb-price">
        <label class="mtb-mrp">
        <b class="lb1"> MRP </b>
        <span class="WebRupee">Rs. </span>
          4,990
       </label>
        <label class="mtb-ofr">
        <b class="lb2"> Now At </b>
        <span class="WebRupee">Rs. </span>
          4,700
       </label>
        </span>
           <span class="offer_block">
          <a class="mtb-more" href="/Products/Buy-Online-Electronics-Cameras-Digital-Cameras/Nikon/Nikon-Coolpix-L27-Point--Shoot/pid-2849731.aspx" title="Click for more details"></div>

解析后,我看不到“ div class =” bucket“”标签。

我该如何处理?

您能告诉我们您的代码吗?

顺便说一句。 如果要解析网站,最好使用connect()而不是parse()

这是如何获取<div class="controlcontent_r">...</div>标签的示例:

final String url = "http://www.jabraat.com/categories/Buy-Digital-Cameras-Online/cid-CU00084377.aspx";
Document doc = Jsoup.connect(url).get();

for( Element element : doc.select("div.controlcontent_r") )
{
    System.out.println(element);
    System.out.println();
}

这段代码打印了三个元素(用空行分隔):

<div class="controlcontent_r">
 <div class="mtc-menu">
  <ul class="mtc-cat">
   <li class="mtc-block"><a class="mtc-a mtc-selected" title="Go To Digital Cameras" href="http://www.jabraat.com/categories/Buy-Digital-Cameras-Online/cid-CU00084377.aspx">Digital Cameras</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go To Camcoders" href="http://www.jabraat.com/categories/Buy-Camcorders-Online/cid-CU00084380.aspx">Camcoders</a></li>
   <li class="mtc-block1"><a class="mtc-a" title="Go To Camera Accessories" href="http://www.jabraat.com/categories/Buy-Camera-Accessories-Online/cid-CU00084381.aspx">Camera Accessories</a></li>
  </ul>
 </div>
</div>

<div class="controlcontent_r">
 <div class="mtc-menu">
  <ul class="mtc-cat">
   <li class="mtc-block"><a class="mtc-a" title="Go To Camera" href="http://www.jabraat.com/categories/Buy-Cameras-Online/cid-CU00084376.aspx">Camera</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go To Digital Photo Frames" href="http://www.jabraat.com/categories/Buy-Digital-Photo-Frames-Online/cid-CU00084382.aspx">Digital Photo Frames</a></li>
   <li class="mtc-block1"><a class="mtc-a" title="Go To Mobiles" href="http://www.jabraat.com/categories/Buy-Mobiles-Online/cid-CU00084383.aspx">Mobiles</a></li>
  </ul>
 </div>
</div>

<div class="controlcontent_r"> 
 <div class="mtc-menu"> 
  <ul class="mtc-cat">
   <li class="mtc-block"><a class="mtc-a" title="Go to Watches" href="http://www.jabraat.com/categories/Buy-Watches-Online/cid-CU00084370.aspx">Watches</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go to Clothing" href="http://www.jabraat.com/categories/Buy-Online-Clothing/cid-CU00084420.aspx">Clothing</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go to Mobiles" href="http://www.jabraat.com/categories/Buy-Mobiles-Online/cid-CU00084383.aspx">Mobiles</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go to Cameras" href="http://www.jabraat.com/categories/Buy-Cameras-Online/cid-CU00084376.aspx">Cameras</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go to Home &amp; Kitchen" href="http://www.jabraat.com/categories/Buy-Home-Kitchen-Appliances-Online/cid-CU00084391.aspx">Home &amp; Kitchen</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go to Personal Care" href="http://www.jabraat.com/categories/Buy-Online-Personal-Care/cid-CU00084413.aspx">Personal Care</a></li>
   <li class="mtc-block"><a class="mtc-a" title="Go to Jewellery" href="http://www.jabraat.com/categories/Buy-Online-Jewellery/cid-CU00084429.aspx">Jewellery</a></li>
   <li class="mtc-block1"><a class="mtc-a" title="Go to Footwear" href="http://www.jabraat.com/categories/Buy-Online-Footwear/cid-CK00101771.aspx">Footwear</a></li>
  </ul> 
 </div> 
</div>


编辑:

如评论中所述,使用<div class='bucket'>标签会使事情变得更加复杂。 尽管您可以使用jsoup轻松解析controlcontent_r标签,但bucket看起来是由脚本生成的。

您可以做一个简单的测试:

final String url = "http://www.jabraat.com/categories/Buy-Digital-Cameras-Online/cid-CU00084377.aspx";
Document doc = Jsoup.connect(url).get(); // Connect an parse the document (as above)


System.out.println(doc); // Output the document (= how jsoup "see"'s the website)

那里没有bucket标记,这意味着您无法使用jsoup检索它-解决方案是使用另一个库来执行脚本。

方便起见,我已经在此处发布了其中的简短列表: 尝试解析javascript隐藏的html

要与Javascript交互,请使用Selenium Framework(将其谷歌搜索)。 然后,您可以将元素解析为JSoup元素。 硒很容易。 我是即时学习的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM