简体   繁体   English

R从JavaScript动作获取HTML数据

[英]R get html data from a javascript action

I would like to scrape some data from a page where I need to click on a button (javascript) that give me access to a table. 我想从需要单击按钮(javascript)的页面上抓取一些数据,以使我能够访问表格。

when your on http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/ you have access to a map and to the data table with a small 'table' button on the left. 当您在http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/上时,您可以使用左侧的小“表”按钮访问地图和数据表。

It opens a new window with the results and I would like to get this result in R. The url of this new page is http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/embfiles/table.html?th0 but I can't have acces to this page if I don't come from the map page. 它将打开一个包含结果的新窗口,我想在R中获得此结果。此新页面的URL为http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/embfiles/ table.html?th0,但如果我不是来自地图页面,则无法访问该页面。

So I would like to know if it's possible to simulate with R something that produce the same effect than a click on this button to have acces to this data. 因此,我想知道是否有可能用R模拟产生与单击此按钮以获得相同数据效果相同的效果。

I have tried 我努力了

path<-"http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/embfiles/table.html?th0"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

but the result obviously doesn't work 但结果显然不起作用

 [1] "<!DOCTYPE HTML>"                                                                                                  
 [2] "<html>"                                                                                                              
 [3] "<meta http-equiv=\"Content-type\" content=\"text/html; charset=UTF-8\" />"                                           
 [4] "<link rel=\"stylesheet\" href=\"style.css\" />"                                                                      
 [5] "<link rel=\"stylesheet\" href=\"rectable.css\" />"                                                                   
 [6] "<script language=\"JavaScript\" type=\"text/javascript\">"                                                           
 [7] "<!--"                                                                                                                
 [8] "function sortTable(theColumn,datatype,orderby) {"                                                                    
 [9] "  document.getElementById(\"content\").innerHTML = \"Veuillez patientez ...\";"                                      
[10] "  var themaId = window.location.search.substr(1,window.location.search.length);"                                     
[11] "  var xslFile = \"styletable.xsl\";"                                                                                 
[12] "  window.opener.mv_loadAttrTableFile(themaId,true);"                                                                 
[13] "  try {"                                                                                                             
[14] "\ttry {"                                                                                                             
[15] "      var xslt = new ActiveXObject(\"Msxml2.XSLTemplate.4.0\");"                                                     
[16] "      var xslDoc = new ActiveXObject(\"Msxml2.FreeThreadedDOMDocument.4.0\");"                                       
[17] "    } catch(e) {"                                                                                                    
[18] "      var xslt = new ActiveXObject(\"Msxml2.XSLTemplate\");"                                                         
[19] "      var xslDoc = new ActiveXObject(\"Msxml2.FreeThreadedDOMDocument\");"                                           
[20] "    }"                                                                                                               
[21] "    xslDoc.async = false;"                                                                                           
[22] "    xslDoc.resolveExternals = false;"                                                                                
[23] "    xslDoc.load(xslFile);"                                                                                           
[24] "    xslt.stylesheet = xslDoc;"                                                                                       
[25] "    var xslProc = xslt.createProcessor();"                                                                           
[26] "    xslProc.input = window.opener.mv_XMLFileArray[themaId].XMLFile;"                                                 
[27] "    if (theColumn) {"                                                                                                
[28] "      xslProc.addParameter(\"field\",\"f\" + (parseInt(theColumn) - 1));"                                            
[29] "      xslProc.addParameter(\"datatype\",datatype);"                                                                  
[30] "      xslProc.addParameter(\"orderby\",orderby);"                                                                    
[31] "    }"                                                                                                               
[32] "    xslProc.transform();"                                                                                            
[33] "    content.innerHTML = xslProc.output;"                                                                             
[34] "  } "                                                                                                                
[35] ""                                                                                                                    
[36] "  catch(e) {"                                                                                                        
[37] "    var xsltProcessor = new XSLTProcessor(); "                                                                       
[38] "    var xslStylesheet = window.opener.mv_loadXMLDoc(window.opener.mv_Doc.BaseURL + \"embfiles/\" + xslFile,\"xml\");"
[39] "    try {"                                                                                                           
[40] "      xsltProcessor.importStylesheet(xslStylesheet);"                                                                
[41] "    }"                                                                                                               
[42] "    catch(err) {"                                                                                                    
[43] "      var xslStylesheet = document.implementation.createDocument(\"\", \"\", null);"                                 
[44] "      xslStylesheet.async = false;"                                                                                  
[45] "      xslStylesheet.load(xslFile);"                                                                                  
[46] "      xsltProcessor.importStylesheet(xslStylesheet);"                                                                
[47] "    }"                                                                                                               
[48] "    if (theColumn) {"                                                                                                
[49] "      xsltProcessor.setParameter(null,\"field\",\"f\" + (parseInt(theColumn) - 1));"                                 
[50] "      xsltProcessor.setParameter(null,\"datatype\",datatype);"                                                       
[51] "      xsltProcessor.setParameter(null,\"orderby\",orderby);"                                                         
[52] "    }"                                                                                                               
[53] "    var resultFragment = xsltProcessor.transformToFragment(window.opener.mv_XMLFileArray[themaId].XMLFile,document);"
[54] "    document.getElementById(\"content\").innerHTML = \"\";"                                                          
[55] "    document.getElementById(\"content\").appendChild(resultFragment);"                                               
[56] "  }"                                                                                                                 
[57] "}"                                                                                                                   
[58] "//-->"                                                                                                               
[59] "</script>"                                                                                                           
[60] "<title>Table attributaire</title>"                                                                                   
[61] "</head>"                                                                                                             
[62] "<body onload=\"sortTable();\">"                                                                                      
[63] "<div id=\"content\">Veuillez patientez ...</div>"                                                                    
[64] "</body>"                                                                                                             
[65] "</html>"                                                                                                             
[66] "" 

any ideas ? 有任何想法吗 ?

thank you 谢谢

在此处输入图片说明 You can use the Inspect Element tool from Chrome to help you identify which types of calls clicking the table button will trigger. 您可以使用Chrome中的“检查元素”工具来帮助您确定单击表格按钮会触发哪些类型的呼叫。

And you can easily retrieve those data using this ajex call. 您可以使用此ajex调用轻松检索这些数据。

http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/embfiles/th0.xml

Then you can start parsing the html from there. 然后,您可以从那里开始解析html。

To parse xml or html, the XML will be a useful tool. 要解析xml或html, XML将是一个有用的工具。 Here is a POC of how to get the title based on the xpath of the element you want. 这是如何根据所需元素的xpath获取标题的POC。

> library(XML)
> library(RCurl)
> url <- "http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/embfiles/th0.xml"
> doc = htmlTreeParse(url, useInternalNodes = T)
> title <- xpathSApply(doc, "//title[@id='titth0']", fun=xmlValue)
> title
[1] "Quantité livrée à la cave coopérative (hl)"

Python BeautifulSoup for scraping: 用于抓取的Python BeautifulSoup:

from bs4 import BeautifulSoup
import urllib2
url = "http://www.si-vitifrance.com/docs/cvi/cvi13/cartes_inter/c_vin01_coop_com07/embfiles/th0.xml"
soup = BeautifulSoup(urllib2.urlopen(url))
f0s = soup.find_all('f0')
for f0 in f0s:
    print f0.text 

Output: 输出:

Commune
07- BOURG-SAINT-ANDEOL
07- VILLENEUVE-DE-BERG
07- LABLACHERE
... 
07- BERRIAS-ET-CASTELJAU
07- BESSAS

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM