如何查詢URL是否被谷歌索引？

Question

我想創建一個 Google 腳本來檢查給定的 URL 是否被 Google 索引，所以我編寫了以下 function：

function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix
  var cururl = url.replace("https://", "");      
  cururl = cururl.replace("http://", "");
  var googlesearchurl = "https://www.google.com/search?q=site:" + encodeURIComponent(cururl);
    var page = UrlFetchApp.fetch(googlesearchurl, {muteHttpExceptions: true}).getContentText();
    // Wait for 1 second before starting another fetch
    Utilities.sleep(1000);
    var number = page.match("did not match any documents");
    if (number) {
      activesheet.getSheetByName("Not Google Index").appendRow([url]);
    } else {
      activesheet.getSheetByName("Google Index").appendRow([url]);
    }  
}

但是在調試代碼的時候，調用UrlFetchApp.fetch后，只能看到變量頁的header。

I try to test the function with a Google Indexed URL and not indexed URL, but both will return null in page.match function, so both are put in "Google Index" sheet.

我的 function 有什么問題？

謝謝

筆記：

我在https://groups.google.com/g/google-apps-script-community/c/gs1qUuKwgn4上問過這個問題，但沒有人回答，所以我必須在這里問。

樣品輸入 & Output

輸入1：

url = https://www.datanumen.com/

activesheet = 包含工作表“Google Index”和“Not Google Index”的 GoogleSheet

預期輸出 1：由於https://www.datanumen.com/已被 Google 索引，因此將添加到“Google 索引”表中。

page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/ - Google Search…"

輸入2：

url = https://www.datanumen.com/notindexedurl/

activesheet = 包含工作表“Google Index”和“Not Google Index”的 GoogleSheet

預期輸出 2：由於https://www.datanumen.com/notindexedurl/ 未被Google 索引，它將被添加到“NOT Google Index”表中。

page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/notindexurl/ - G…"

問題目前針對 Input1 和 Input2，實際結果是：URL 將始終添加到“Google 索引”表，因為搜索結果根本不會包含“與任何文檔不匹配”文本。

更新

我添加了 console.log(page); 並再次調試。 對於 Input1，我得到以下結果：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>

Answer 1

回答：

不幸的是，通過嘗試使用 UrlFetchApp 直接抓取 web 的搜索結果來直接這樣做是行不通的。 但是，您可以使用第三方工具來獲取搜索結果的數量。

可以做什么：

您可以使用許多第三方 API 來執行此操作，我建議您搜索一個滿足您需求的 API。

我測試了一個名為Authoritas的方法，它返回不同關鍵字的搜索引擎索引。 API 是異步的，因此可能需要一分鍾才能得到響應，因此需要制作 Web App 解決方案。

我使用的流程如下：

從 Authoritas 獲取 API 密鑰（免費）
創建一個新的 Apps 腳本項目以進行 API 調用：

function makeApiCall(url, method, site) {
  const public_key = ""
  const private_key = ""
  const salt = "" 
  let timestamp = Date.now()

  const hash = Utilities.computeHmacSha256Signature(timestamp + public_key + salt, private_key)
  const headers = {
    "Authorization": "KeyAuth publicKey=" + public_key + " hash=" + toHexString(hash) + " ts=" + timestamp,
    "Content-Type": "application/json"
  }

  const requestParameters = {
    "search_engine": "google",
    "region": "us",
    "language": "en",
    "max_results": 100,
    "phrase": site,
    "search_type": "web",
    "user_agent": "pc",
    "parameters": {
      "priority": "standard"
    },
    "callback_type": "full",
    "callback": "script-web-app-exec-url"
  }

  const options = {
    "method": method,
    "headers": headers,
    "muteHttpExceptions": true,
    "payload": JSON.stringify(requestParameters)
  }

  const response = UrlFetchApp.fetch(url, options)
  return response
}

function toHexString(byteArray) {
  const hexString = Array.from(byteArray, function(byte) {
    return ('0' + (byte & 0xFF).toString(16)).slice(-2)
  }).join('')
  return hexString
}

還有一個doPost(e) function 這樣當 API 返回數據時，它可以被處理：

function doPost(e) {
  const jsonData = JSON.parse(e.postData.contents)
  const pages = jsonData.response.summary.pages
  const ss = SpreadsheetApp.openById("1QBzDdGn1yaUxFJciLH_Ru-BbLHuBIZTUk2UnrUShGw0") 
  
  if (Object.keys(pages).length == 0) {
    ss.getSheetByName("Not Google Index").appendRow([jsonData.request.phrase])
  }
  else {
    ss.getSheetByName("Google Index").appendRow([jsonData.request.phrase])
  }  
}

然后我從這里發布了 Web 應用程序，其設置如下：

Execute as: me
Who has access: Anyone （不是Anyone with a Google account ）

請記住在提供時復制 Web 應用程序 URL 並將其粘貼到有效負載的"callback": "script-web-app-exec-url"部分（通常可以使用ScriptApp.getService().getUrl()但作為根據這個問題，當從腳本編輯器運行代碼時，此方法返回/dev鏈接而不是/exec鏈接，這將不起作用）。

然后可以像這樣簡單地運行：

function run() {
  const req = makeApiCall("v3.api.analyticsseo.com/serps/", "POST", "asdhfdhdfgdsfser.com")
  console.log(req.getContentText())
}

The request will run, a reponse from the API will be logged containing the request object, and then when the request is ready, the Authoritas API will call the script URL you provided in the callback parameter which will run the doPost() method.

這是一個復雜的解決方法，但不幸的是，如今 web 抓取變得越來越難。

參考：

Class UrlFetchApp | Apps 腳本 | 谷歌開發者
Web 應用 | Apps 腳本 | 谷歌開發者
谷歌 IP 范圍
rafa-guillermo/Testing-Binance-API-in-Google-Apps-Script - GitHub
Authoritas 結構化關鍵詞排名 API 文檔
- 詳細請求 Object - Authoritas
- 詳細響應 Object - Authoritas

如何查詢URL是否被谷歌索引？

問題描述

1 個解決方案

解決方案1
5 已采納 2021-06-08 14:33:18

回答：

更多信息：

可以做什么：

參考：

如何查詢URL是否被谷歌索引？

問題描述

1 個解決方案

解決方案1 5 已采納 2021-06-08 14:33:18

回答：

更多信息：

可以做什么：

參考：

解決方案1
5 已采納 2021-06-08 14:33:18