简体   繁体   English

超过 Google 电子表格上的 ImportXML 限制

[英]To exceed the ImportXML limit on Google Spreadsheet

I am stucking on a "scraping problem" right now.我现在陷入了“刮擦问题”。 Especially i want to extract the name of the author from a webpage to google spreadsheet.特别是我想从网页中提取作者的姓名到谷歌电子表格。 Actually the function =IMPORTXML(A2,"//span[@class='author vcard meta-item']") is working, but after i raise the amount of links to scrape it just starts to load endless.实际上,函数=IMPORTXML(A2,"//span[@class='author vcard meta-item']")正在工作,但是在我增加要抓取的链接数量后,它开始无限加载。

So i researched and find out, that this problem is due to the fact, that there is a limit of google.所以我研究并发现,这个问题是由于谷歌的限制。

Does anybody know of to exceed the limit or a script, which i could "easily copy" ?有没有人知道超过限制或脚本,我可以“轻松复制”? - i really do not have a hunch of coding. - 我真的没有编码的预感。

I created a custom import function that overcomes all limits of IMPORTXML I have a sheet using this in about 800 cells and it works great.我创建了一个自定义导入函数,它克服了 IMPORTXML 的所有限制我有一个在大约 800 个单元格中使用它的工作表,它工作得很好。

It makes use of Google Sheet's custom scripts (Tools > Script editor…) and searches through content using regex instead of xpath.它利用 Google Sheet 的自定义脚本(工具 > 脚本编辑器...)并使用正则表达式而不是 xpath 搜索内容。

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return output;
}

You can then use this function like any function.然后,您可以像使用任何函数一样使用此函数。

=importRegex("https://example.com", "<title>(.*)<\\/title>")

Of course, you can also reference cells.当然,您也可以引用单元格。

=importRegex(A2, "<title>(.*)<\\/title>")

If you don't want to see HTML entities in the output, you can use this function.如果不想在输出中看到 HTML 实体,可以使用此功能。

var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};

function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;

        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

All together…大家一起…

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return unescapeHTML(output);
}

var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};

function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;

        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

There is no such script to exceed the limits.没有这样的脚本可以超出限制。 Since the code is run on a Google machine (server) you can not cheat.由于代码在 Google 机器(服务器)上运行,因此您不能作弊。 Some limits are bind to your spreadsheet, so you could try to use multiple spreadsheets, if that helps.某些限制绑定到您的电子表格,因此您可以尝试使用多个电子表格(如果有帮助)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM