从网站抓取表格，使用 javascript:subOpen href 链接

Question

I would like to scrape for each link on this page the page details page behind.我想为这个页面上的每个链接抓取页面详细信息页面后面的内容。

I can get all informations on this page: PAGE我可以在此页面上获取所有信息： PAGE

However, I would like to get all info's on the details page, but the href link looks like that, for example:但是，我想在详细信息页面上获取所有信息，但 href 链接看起来像这样，例如：

href="javascript:subOpen('9ca8ed0fae15d43dc1257e7300345b99')"

Here is my sample spreadsheet using the ImportHTML function to get the general overview.这是我使用ImportHTML函数获取一般概述的示例电子表格。

Google Spreadsheet 谷歌电子表格

Any suggestions how to get the details pages?任何建议如何获取详细信息页面？

UPDATE更新

I implemented the method the following:我实现了以下方法：

function doGet(e){
  var base = 'http://www.ediktsdatei.justiz.gv.at/edikte/ex/exedi3.nsf/'
  var feed =  UrlFetchApp.fetch(base + 'suche?OpenForm&subf=e&query=%28%5BVKat%5D%3DEH%20%7C%20%5BVKat%5D%3DZH%20%7C%20%5BVKat%5D%3DMH%20%7C%20%5BVKat%5D%3DMW%20%7C%20%5BVKat%5D%3DMSH%20%7C%20%5BVKat%5D%3DGGH%20%7C%20%5BVKat%5D%3DRH%20%7C%20%5BVKat%5D%3DHAN%20%7C%20%5BVKat%5D%3DWE%20%7C%20%5BVKat%5D%3DEW%20%7C%20%5BVKat%5D%3DMAI%20%7C%20%5BVKat%5D%3DDTW%20%7C%20%5BVKat%5D%3DDGW%20%7C%20%5BVKat%5D%3DGA%20%7C%20%5BVKat%5D%3DGW%20%7C%20%5BVKat%5D%3DUL%20%7C%20%5BVKat%5D%3DBBL%20%7C%20%5BVKat%5D%3DLF%20%7C%20%5BVKat%5D%3DGL%20%7C%20%5BVKat%5D%3DSE%20%7C%20%5BVKat%5D%3DSO%29%20AND%20%5BBL%5D%3D0').getContentText();

       var d = document.createElement('div'); //assuming you can do this
       d.innerHTML = feed;//make the text a dom structure
       var arr = d.getElementsByTagName('a') //iterate over the page links
       var response = "";
       for(var i = 0;i<arr.length;i++){
         var atr = arr[i].getAttribute('onclick');
         if(atr) atr = atr.match(/subOpen\((.*?)\)/) //if onclick calls subOpen
         if(atr && atr.length > 1){ //get the id
            var detail = UrlFetchApp.fetch(base + '0/'+atr[1]).getContentText();
            response += detail//process the relevant part of the content and append to the reposnse text
         }
        }      
       return ContentService.createTextOutput(response);
}

However, I get an error when running the method:但是，运行该方法时出现错误：

ReferenceError: "document" is not defined.参考错误：“文档”未定义。 (line 6, file "") （第 6 行，文件“”）

What is the document an object of? document的对象是什么？

I have update the Google Spreadsheet with a webapp.我已经用 webapp 更新了谷歌电子表格。

Answer 1

You can use Firebug in order to inspect the page contents and javascript.您可以使用 Firebug 来检查页面内容和 javascript。 For instance you can find that subOpen is actually an alias to subOpenXML declared in xmlhttp01.js .例如，您会发现 subOpen 实际上是 xmlhttp01.js 中声明的subOpenXML的别名。

function subOpenXML(unid) {/*open found doc from search view*/
 if (waiting) return alert(bittewar);
 var wState = dynDoc.getElementById('windowState');
 wState.value = 'H';/*httpreq pending*/
 var last = '';
 if (unid==docLinks[0]) {last += '&f=1'; thisdocnum = 1;}
 if (unid==docLinks[docLinks.length-1]) {
  last += '&l=1';
  thisdocnum = docLinks.length;
 } else {
  for (var i=1;i<docLinks.length-1;i++)
   if (unid==docLinks[i]) {thisdocnum = i+1; break;}
 }
 var url = unid + html_delim + 'OpenDocument'+last + '&bm=2';
 httpreq.open('GET',    // &rand=' + Math.random();
  /*'/edikte/test/ex/exedi31.nsf/0/'+*/ '0/'+url, true);
 httpreq.onreadystatechange=onreadystatechange;
// httpreq.setRequestHeader('Accept','text/xml');
 httpreq.send(null);
 waiting = true;
 title2src = firstTextChild(dynDoc.getElementById('title2')).nodeValue;
}

So, after copying the function source and modifying it in firebug's Console tab to add a console.log(url) before the http call, like this:因此，在复制函数源并在 firebug 的 Console 选项卡中修改它以在 http 调用之前添加console.log(url)之后，如下所示：

 var url = unid + html_delim + 'OpenDocument'+last + '&bm=2';
 console.log(url)
 httpreq.open('GET',    // &rand=' + Math.random();
  /*'/edikte/test/ex/exedi31.nsf/0/'+*/ '0/'+url, true);

You can execute the function declaration in firebug's Console tab and overwrite subOpen with the modified source.您可以在 firebug 的 Console 选项卡中执行函数声明，并使用修改后的源覆盖 subOpen。 Clickin in the link then will show that the invoked url is composed of the id passed as parameter to subOpen prefixed by '0/', so in the example you posted it would be a GET to:单击链接然后将显示调用的 url 由作为参数传递给 subOpen 的 id 组成，前缀为“0/”，因此在您发布的示例中，它将是一个 GET 到：

http://www.ediktsdatei.justiz.gv.at/edikte/ex/exedi3.nsf/0/1fd2313c2e0095bfc1257e49004170ca?OpenDocument&f=1&bm=2

You could also verify this by opening the Network tab in firebug and clicking the link.您还可以通过在 firebug 中打开“网络”选项卡并单击链接来验证这一点。

Therefore, in order to scrape the details page you'd need to因此，为了抓取详细信息页面，您需要

Parse the id passed to subOpen解析传递给 subOpen 的 id
Make a GET call to '0/'对“0/”进行 GET 调用
Parse the request response解析请求响应

Looking the request response in firebug's Network Tab shows that probably you'll need to do similar parsing to actually get the showed contents, but I haven't looked deep into it.查看 firebug 的 Network Tab 中的请求响应表明，您可能需要进行类似的解析才能实际获取显示的内容，但我没有深入研究。

UPDATE The importHTML function is not suitable for the kind of scraping you want.更新importHTML 函数不适合您想要的那种抓取。 Google's HTML or Content Services are better suited for this. Google 的HTML或内容服务更适合于此。 You'll need to create a web app and implement the doGet function:您需要创建一个Web 应用程序并实现doGet函数：

function doGet(e){
  var base = 'http://www.ediktsdatei.justiz.gv.at/edikte/ex/exedi3.nsf/'
  var feed =  UrlFetchApp.fetch(base + 'suche?OpenForm&subf=e&query=%28%5BVKat%5D%3DEH%20%7C%20%5BVKat%5D%3DZH%20%7C%20%5BVKat%5D%3DMH%20%7C%20%5BVKat%5D%3DMW%20%7C%20%5BVKat%5D%3DMSH%20%7C%20%5BVKat%5D%3DGGH%20%7C%20%5BVKat%5D%3DRH%20%7C%20%5BVKat%5D%3DHAN%20%7C%20%5BVKat%5D%3DWE%20%7C%20%5BVKat%5D%3DEW%20%7C%20%5BVKat%5D%3DMAI%20%7C%20%5BVKat%5D%3DDTW%20%7C%20%5BVKat%5D%3DDGW%20%7C%20%5BVKat%5D%3DGA%20%7C%20%5BVKat%5D%3DGW%20%7C%20%5BVKat%5D%3DUL%20%7C%20%5BVKat%5D%3DBBL%20%7C%20%5BVKat%5D%3DLF%20%7C%20%5BVKat%5D%3DGL%20%7C%20%5BVKat%5D%3DSE%20%7C%20%5BVKat%5D%3DSO%29%20AND%20%5BBL%5D%3D0').getContentText();
       var response = "";
       var match = feed.match(/subOpen\('.*?'\)/g)
       if(match){
         for(var i = 0; i < match.length;i++){
              var m = match[i].match(/\('(.*)'\)/);
              if(m && m.length > 1){
                var detailText = UrlFetchApp.fetch(base + '0/'+m[1]);
                response += //dosomething with detail text 
                            //and concatenate in the response
              }
         }
       }
       return ContentService.createTextOutput(response);


}

从网站抓取表格，使用 javascript:subOpen href 链接

问题描述

1 个解决方案

解决方案1
6 已采纳 2015-07-24 12:28:23

从网站抓取表格，使用 javascript:subOpen href 链接

问题描述

1 个解决方案

解决方案1 6 已采纳 2015-07-24 12:28:23

解决方案1
6 已采纳 2015-07-24 12:28:23