简体   繁体   English

R-Advanced Web Scraping - 使用xmlTreeParse()绕过aspNetHidden

[英]R-Advanced Web Scraping-bypassing aspNetHidden using xmlTreeParse()

This question takes a bit of time to introduce, bear with me. 这个问题需要一些时间来介绍,请耐心等待。 It will be fun to solve if you can get there. 如果你能到达那里,解决它会很有趣。 This scrape would be replicated over thousands of pages on this website using a loop. 这个scrape将使用循环在本网站的数千页上复制。

I'm trying to scrape the website http://www.digikey.com/product-detail/en/207314-1/A25077-ND/ looking to capture the data in the table with Digi-Key Part Number, Quantity Available etc.. including the right hand side with Price Break, Unit Price, Extended Price. 我正试图抓取网站http://www.digikey.com/product-detail/en/207314-1/A25077-ND/寻找使用Digi-Key零件编号,可用数量等捕获表中的数据..包括价格突破,单价,延长价格的右侧。

Using the R function readHTMLTable() doesn't work and only returns NULL values. 使用R函数readHTMLTable()不起作用,只返回NULL值。 The reason for this (I believe) is because the website has hidden it's content using the tag "aspNetHidden" in the html code. 这个(我相信)的原因是因为网站使用html代码中的“aspNetHidden”标签隐藏了它的内容。

For this reason I also found difficulty using htmlTreeParse() and xmlTreeParse() with the whole section parented by not appearing in the results. 出于这个原因,我还发现使用htmlTreeParse()和xmlTreeParse()很困难,整个部分没有出现在结果中。

Using the R function scrape() from the scrapeR package 使用scrapeR包中的R函数scrape()

require(scrapeR)

URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")

does return the full html code including the lines of interest: 确实返回完整的HTML代码,包括感兴趣的行:

<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>

<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>

However, I haven't been able to select the nodes out of this block of code with the error being returned: 但是,我无法从这段代码中选择节点并返回错误:

no applicable method for 'xpathApply' applied to an object of class "list"

I've received that error using different functions such as: 我使用不同的功能收到了这个错误,例如:

xpathSApply(URL,'//*[@id="pricing"]/tbody/tr[2]')

getNodeSet(URL,"//html[@class='rd-product-details-page']")

I'm not the most familiar with xpath but have been identifying the xpath using inspect element on the webpage and copy xpath. 我不是最熟悉xpath但是已经使用网页上的inspect元素识别xpath并复制xpath。

Any help you can give on this would be much appreciated! 任何你可以给予的帮助将非常感谢!

You've not read the help for scrape have you? 你还没看过刮刮的帮助吗? It returns a list, you need to get parts of that list (if parse=TRUE) and so on. 它返回一个列表,你需要获取该列表的一部分(如果parse = TRUE),依此类推。

Also I think that web page is doing some heavy heavy browser detection. 另外我认为网页正在做一些沉重的浏览器检测。 If I try and wget the page from the command line I get an error page, the scrape function gets something usable (but seems different to you) and Chrome gets the full junk with all the encoded stuff. 如果我试图wget命令行我得到一个错误页的页面时, scrape函数获取有用的东西(但似乎不同的你)和Chrome得到充分的垃圾与所有编码的东西。 Yuck. 呸。 Here's what works for me: 这对我有用:

> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
  <tr class="product-details-top"/>
  <tr class="product-details-bottom">
    <td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
  </tr>
  <tr>
    <th align="right">Digi-Key Part Number</th>
    <td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
    <td class="catalog-pricing" rowspan="6" align="center" valign="top">
      <table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
        <tr>
          <th>Price Break</th>
          <th>Unit Price</th>
          <th>Extended Price&#13;
</th>
        </tr>
        <tr>
          <td align="center">1</td>
          <td align="right">2.75000</td>
          <td align="right">2.75</td>

Adjust to your use-case, here I'm getting all the tables and showing the second one, which has the info you want, some of it in the pricing table which you can get directly with: 调整你的用例,在这里我得到所有的表并显示第二个表,它有你想要的信息,其中一些在你可以直接得到的pricing表中:

pricing = xpathSApply(URL[[1]],'//table[@id="pricing"]')[[1]]

> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
  <tr>
    <th>Price Break</th>
    <th>Unit Price</th>
    <th>Extended Price&#13;
</th>
  </tr>
  <tr>
    <td align="center">1</td>
    <td align="right">2.75000</td>
    <td align="right">2.75</td>
  </tr>

and so on. 等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM