簡體   English   中英

如何查詢URL是否被谷歌索引?

[英]How to Query if A URL is Indexed by Google?

我想創建一個 Google 腳本來檢查給定的 URL 是否被 Google 索引,所以我編寫了以下 function:

function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix
  var cururl = url.replace("https://", "");      
  cururl = cururl.replace("http://", "");
  var googlesearchurl = "https://www.google.com/search?q=site:" + encodeURIComponent(cururl);
    var page = UrlFetchApp.fetch(googlesearchurl, {muteHttpExceptions: true}).getContentText();
    // Wait for 1 second before starting another fetch
    Utilities.sleep(1000);
    var number = page.match("did not match any documents");
    if (number) {
      activesheet.getSheetByName("Not Google Index").appendRow([url]);
    } else {
      activesheet.getSheetByName("Google Index").appendRow([url]);
    }  
} 

但是在調試代碼的時候,調用UrlFetchApp.fetch后,只能看到變量頁的header。

I try to test the function with a Google Indexed URL and not indexed URL, but both will return null in page.match function, so both are put in "Google Index" sheet.

我的 function 有什么問題?

謝謝

筆記:

我在https://groups.google.com/g/google-apps-script-community/c/gs1qUuKwgn4上問過這個問題,但沒有人回答,所以我必須在這里問。

樣品輸入 & Output

輸入1:

url = https://www.datanumen.com/

activesheet = 包含工作表“Google Index”和“Not Google Index”的 GoogleSheet

預期輸出 1:由於https://www.datanumen.com/已被 Google 索引,因此將添加到“Google 索引”表中。

page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/ - Google Search…"

輸入2:

url = https://www.datanumen.com/notindexedurl/

activesheet = 包含工作表“Google Index”和“Not Google Index”的 GoogleSheet

預期輸出 2:由於https://www.datanumen.com/notindexedurl/ 未被Google 索引,它將被添加到“NOT Google Index”表中。

page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/notindexurl/ - G…"

問題目前針對 Input1 和 Input2,實際結果是:URL 將始終添加到“Google 索引”表,因為搜索結果根本不會包含“與任何文檔不匹配”文本。

更新

我添加了 console.log(page); 並再次調試。 對於 Input1,我得到以下結果:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>

回答:

不幸的是,通過嘗試使用 UrlFetchApp 直接抓取 web 的搜索結果來直接這樣做是行不通的。 但是,您可以使用第三方工具來獲取搜索結果的數量。

更多信息:

我使用指數退避方法對此進行了測試,當UrlFetchApp調用獲取請求時,該方法有時能夠超過429個錯誤。

當使用UrlFetchApp到 web 抓取或連接到 API 時,服務器可能會以請求too many requests - 或HTTP Error 429

Google Apps 腳本在雲端運行,來自 Google 擁有的池中的一組 IP 地址。 您實際上可以在這里看到所有 IP 范圍。 大多數網站(尤其是像谷歌這樣的大公司)都有適當的架構來防止使用機器人抓取他們的網站並減慢流量。

有時可以通過混合使用指數退避和隨機時間間隔來克服此錯誤, 如 Binance API 所示完整披露:此 GitHub 存儲庫由我編寫。)

我假設 Google 直接阻止了 Apps Script IP 池,或者有太多人嘗試相同的事情 - 因為使用相同的技術,我無法得到任何不涉及輸入驗證碼的響應,正如我們在上面的評論,可以在page字符串的日志中看到。

可以做什么:

您可以使用許多第三方 API 來執行此操作,我建議您搜索一個滿足您需求的 API。

我測試了一個名為Authoritas的方法,它返回不同關鍵字的搜索引擎索引。 API 是異步的,因此可能需要一分鍾才能得到響應,因此需要制作 Web App 解決方案。

我使用的流程如下:

function makeApiCall(url, method, site) {
  const public_key = ""
  const private_key = ""
  const salt = "" 
  let timestamp = Date.now()

  const hash = Utilities.computeHmacSha256Signature(timestamp + public_key + salt, private_key)
  const headers = {
    "Authorization": "KeyAuth publicKey=" + public_key + " hash=" + toHexString(hash) + " ts=" + timestamp,
    "Content-Type": "application/json"
  }

  const requestParameters = {
    "search_engine": "google",
    "region": "us",
    "language": "en",
    "max_results": 100,
    "phrase": site,
    "search_type": "web",
    "user_agent": "pc",
    "parameters": {
      "priority": "standard"
    },
    "callback_type": "full",
    "callback": "script-web-app-exec-url"
  }

  const options = {
    "method": method,
    "headers": headers,
    "muteHttpExceptions": true,
    "payload": JSON.stringify(requestParameters)
  }

  const response = UrlFetchApp.fetch(url, options)
  return response
}

function toHexString(byteArray) {
  const hexString = Array.from(byteArray, function(byte) {
    return ('0' + (byte & 0xFF).toString(16)).slice(-2)
  }).join('')
  return hexString
}

還有一個doPost(e) function 這樣當 API 返回數據時,它可以被處理:

function doPost(e) {
  const jsonData = JSON.parse(e.postData.contents)
  const pages = jsonData.response.summary.pages
  const ss = SpreadsheetApp.openById("1QBzDdGn1yaUxFJciLH_Ru-BbLHuBIZTUk2UnrUShGw0") 
  
  if (Object.keys(pages).length == 0) {
    ss.getSheetByName("Not Google Index").appendRow([jsonData.request.phrase])
  }
  else {
    ss.getSheetByName("Google Index").appendRow([jsonData.request.phrase])
  }  
}

然后我從這里發布了 Web 應用程序,其設置如下:

  • Execute as: me
  • Who has access: Anyone不是Anyone with a Google account

請記住在提供時復制 Web 應用程序 URL 並將其粘貼到有效負載的"callback": "script-web-app-exec-url"部分(通常可以使用ScriptApp.getService().getUrl()但作為根據這個問題,當從腳本編輯器運行代碼時,此方法返回/dev鏈接而不是/exec鏈接,這將不起作用)。

然后可以像這樣簡單地運行:

function run() {
  const req = makeApiCall("v3.api.analyticsseo.com/serps/", "POST", "asdhfdhdfgdsfser.com")
  console.log(req.getContentText())
}

The request will run, a reponse from the API will be logged containing the request object, and then when the request is ready, the Authoritas API will call the script URL you provided in the callback parameter which will run the doPost() method.

這是一個復雜的解決方法,但不幸的是,如今 web 抓取變得越來越難。

參考:

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM