简体   繁体   English

Jsoup Web抓取

[英]Jsoup web scraping

I am trying to use jSoup to scrape a website that has the following. 我正在尝试使用jSoup抓取具有以下内容的网站。 I am very new to jSoup and am still trying to figure it out. 我对jSoup还是很陌生,并且仍在设法解决它。 What I would like to do is be able to take the product name and price and put them into an excel file with the name in column A and the price in column B, the 0.00 can either be ignored or placed in column C whatever is easier. 我想做的是能够获取产品名称和价格,并将它们放入A列中的名称和B列中的价格的excel文件中,0.00可以忽略,也可以放在C列中, 。 Any help would be great and just cause I know someone will ask, this is NOT a homework assignment. 任何帮助都是很好的,只是因为我知道有人会问,这不是家庭作业。
Thanks in advance I really appreciate it. 在此先感谢,我非常感谢。

<tr>
        <td class="sku" width="40" align="center">AAN13097</td>
        <td class="productName" width="440"><a name="<!-- Empty field [Field4]  -->"></a> 
                                American Antler Dog Chew Large (40-60 lb Dogs)                                          </td>
        <!--<td id="weight_816">0</td>-->
        <td class="quantity" width="20" align="center">
            <input type="text" name="816:qnty" id="qnty_816" class="inputQuantity">
            <input type="checkbox" name="itemnum" value="816" id="itemnum_816" class="itemnum">
        </td>
        <!--<td class="extWeight" id="extWeight_816">0.0</td>-->
        <td width="80" align="center" id="price_816">$9.70</td>
        <td width="120" align="center" class="extPrice" id="extPrice_816">$0.00</td>
    </tr>
                                                                                                                <!-- rec 815 -->

<tr>
        <td class="sku" width="40" align="center">AAN13096</td>
        <td class="productName" width="440"><a name="<!-- Empty field [Field4]  -->"></a> 
                                American Antler Dog Chew Medium (20-40 lb Dogs)                                         </td>
        <!--<td id="weight_815">0</td>-->
        <td class="quantity" width="20" align="center">
            <input type="text" name="815:qnty" id="qnty_815" class="inputQuantity">
            <input type="checkbox" name="itemnum" value="815" id="itemnum_815" class="itemnum">
        </td>
        <!--<td class="extWeight" id="extWeight_815">0.0</td>-->
        <td width="80" align="center" id="price_815">$7.15</td>
        <td width="120" align="center" class="extPrice" id="extPrice_815">$0.00</td>
    </tr>

** Would this be the table element as this is the "table" code before the list, if not what should I be looking for in the html code? **这是表格元素,因为这是列表之前的“表格”代码,如果不是,我应该在html代码中寻找什么?

<table border="0" cellpadding="8" cellspacing="0" id="orderForm" width="700">
<thead>
<tr>
<th width="40px" align="center">Line</th>
<th width="420" align="center">Item description&nbsp;</th>
<th width="40px" align="center">Quantity</th>
<th width="80px" align="center">Unit Price</th>
<th width="120px" align="center">Amount</th>
</tr>
</table><div class="tableCont"><table border="0" cellpadding="8" cellspacing="0"    
id="orderForm" width="700" height="350px">
<tbody>                                                                                                           
<!-- rec 1638 -->
<a name="1638"></a>

This should do it. 这应该做。 However HTML you posted didn't contain table parent for tr, that of course must be in HTML for this code to work, otherwise Jsoup will drop tr/td elements and code won't work. 但是,您发布的HTML不包含表tr的父表,当然,该表必须在HTML中才能起作用,否则Jsoup将删除tr / td元素,并且代码将不起作用。

Document doc = Jsoup.parse(html); // html attribute should contain tr elements HTML content
String productName = doc.select("tr .productName").first().text(); // Get name
Element extPriceElement = doc.select("tr td.extPrice").first();
String id = extPriceElement.id().replaceAll("extPrice_", ""); // Get id     
String productPrice = doc.select("tr #price_" + id).first().text(); // Get price
String productExtPrice = extPriceElement.text(); // Get ext price
System.out.println("Product name : " + productName);                
System.out.println("Price : " + productPrice);
System.out.println("Ext price : " + productExtPrice);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM