簡體   English   中英

如何檢索doi URL的最終URL?

[英]How to retrieve the ultimate URL for a doi URL?

當我訪問doi URL時,它會重定向到以下URL。

https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715

但這不是最終的URL https://www.sciencedirect.com/science/article/pii/S1550413115002715?via%3Dihub

$ wget --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' https://doi.org/10.1016/j.cmet.2015.06.004
$ grep Redirect j.cmet.2015.06.004.html |grep meta
<meta HTTP-EQUIV="REFRESH" content="2; url='/retrieve/articleSelectPrefsPerm?Redirect=https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS1550413115002715%3Fvia%253Dihub&amp;key=f0d7d908599d0c4f0ee467d0e225836b1927eb91'"/>
$ wget -S -o /dev/stderr --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' https://doi.org/10.1016/j.cmet.2015.06.004 > /dev/null
--2019-08-08 06:01:13--  https://doi.org/10.1016/j.cmet.2015.06.004
Resolving doi.org (doi.org)... 104.26.9.237, 104.26.8.237, 2606:4700:20::681a:8ed, ...
Connecting to doi.org (doi.org)|104.26.9.237|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 302 
  Date: Thu, 08 Aug 2019 11:01:14 GMT
  Content-Type: text/html;charset=utf-8
  Content-Length: 209
  Connection: keep-alive
  Set-Cookie: __cfduid=d1dd9844bf9c103fcc56abf104a78957b1565262073; expires=Fri, 07-Aug-20 11:01:13 GMT; path=/; domain=.doi.org; HttpOnly
  Vary: Accept
  Location: https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715
  Expires: Thu, 08 Aug 2019 11:27:57 GMT
  Link: <https://dul.usage.elsevier.com/doi/>; rel=dul
  Strict-Transport-Security: max-age=86400; includeSubDomains
  Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
  Server: cloudflare
  CF-RAY: 5030fdb9c8dde04d-DFW
Location: https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715 [following]
--2019-08-08 06:01:14--  https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715
Resolving linkinghub.elsevier.com (linkinghub.elsevier.com)... 18.204.111.22, 34.198.26.18
Connecting to linkinghub.elsevier.com (linkinghub.elsevier.com)|18.204.111.22|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 
  Date: Thu, 08 Aug 2019 11:01:14 GMT
  Content-Type: text/html;charset=UTF-8
  Content-Length: 8144
  Connection: keep-alive
  Set-Cookie: JSESSIONID=9EB99263F2DD8482804BE74C0DDBAE51; Path=/retrieve; Secure; HttpOnly
  Pragma: no-cache
  Cache-Control: no-cache, no-store, must-revalidate
  Expires: Thu, 01 Jan 1970 00:00:00 GMT
  Set-Cookie: visitorId=vOzKJBQjOR53unZLGF8y; Max-Age=2147483647; Expires=Tue, 26-Aug-2087 14:15:21 GMT; Path=/
  P3P: CP="NON DSP COR CUR ADM DEV TAI PSA PSD OUR IND UNI NAV STA PRE COM INT CNT",policyref="https://linkinghub.elsevier.com/retrieve/static/P3P/IHUB-p3p.xml"
  Content-Language: en-US
Length: 8144 (8.0K) [text/html]
Saving to: ‘j.cmet.2015.06.004’

     0K .......                                               100%  123M=0s

2019-08-08 06:01:14 (123 MB/s) - ‘j.cmet.2015.06.004’ saved [8144/8144]

我嘗試了以下偽娘代碼來嘗試自動處理它。 但是失敗了。 有人知道自動將其重定向到最后一頁嗎?

$ cat puptr2cntnt.js 
#!/usr/bin/env node
// vim: set noexpandtab tabstop=2:

const puppeteer = require('puppeteer');
const fs = require('fs');

const url = process.argv[2];

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    const content = await page.content();
    console.log(content);
    await browser.close();
})();
$ ./puptr2cntnt.js  https://doi.org/10.1016/j.cmet.2015.06.004
(node:73532) UnhandledPromiseRejectionWarning: Error: Execution context was destroyed, most likely because of a navigation.
    at rewriteError (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:161:15)
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
    at async ExecutionContext.evaluateHandle (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:119:56)
    at async ExecutionContext.evaluate (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:48:20)
    at async DOMWorld.content (/usr/local/lib/node_modules/puppeteer/lib/DOMWorld.js:185:12)
    at async Page.content (/usr/local/lib/node_modules/puppeteer/lib/Page.js:612:12)
    at async /Users/pengy/linux/bin/wrappercomposite/src/xplat/puptrxplat/src/puptr2cntnt/node/default/puptr2cntnt.js:13:18
  -- ASYNC --
    at ExecutionContext.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:110:27)
    at ExecutionContext.evaluate (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:48:31)
    at ExecutionContext.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:111:23)
    at DOMWorld.evaluate (/usr/local/lib/node_modules/puppeteer/lib/DOMWorld.js:112:20)
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
    at async DOMWorld.content (/usr/local/lib/node_modules/puppeteer/lib/DOMWorld.js:185:12)
  -- ASYNC --
    at Frame.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:110:27)
    at Page.content (/usr/local/lib/node_modules/puppeteer/lib/Page.js:612:49)
    at Page.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:111:23)
    at /Users/pengy/linux/bin/wrappercomposite/src/xplat/puptrxplat/src/puptr2cntnt/node/default/puptr2cntnt.js:13:29
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
(node:73532) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:73532) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

您必須等待導航完成,可以在goto方法之后添加waitForNavigation方法。

await page.waitForNavigation({waituntil: 'domcontentloaded'});

或者只是將{waituntil: 'domcontentloaded'}值添加到goto方法的第二個參數。

await page.goto(url, {waituntil: 'domcontentloaded'});

完整腳本:

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, {waituntil: 'domcontentloaded'});
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

何時考慮導航成功,默認為加載。 給定事件字符串數組,所有事件觸發后,導航被認為是成功的。 事件可以是:

  • load -觸發加載事件后,導航將結束。
  • domcontentloaded激發DOMContentLoaded事件時,導航將完成。
  • networkidle0當網絡連接數不超過0且持續至少500 ms時,請考慮完成導航。
  • networkidle2至少有500毫秒的網絡連接不超過2個時,請考慮完成導航。

閱讀更多文檔

您將不得不多次調用page.waitForNavigation ,因為在這種情況下,網站將重定向到一個頁面,該頁面要等待一段時間才能重定向到另一個頁面。 要自動執行此操作,可以使用以下功能:

async function waitForMoreNavigation(page) {
  try {
    while (true) {
      await page.waitForNavigation({ timeout: 2000 });
    }
  } catch (err) {} // timeout is thrown, abort the progress
}

該功能一直在循環內等待更多的導航,直到沒有更多的導航事件發生和超時命中為止。 請記住,這將至少等待兩秒鍾,然后再進行操作。 根據您的任務,您可能需要更改timeout的值。

代碼樣例

page.goto調用之后使用該函數,如下所示:

await page.goto('https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715');
await waitForMoreNavigation(page);
console.log(page.url());

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM