简体   繁体   English

Google Scholar 阻止我使用 search_pubs

[英]Google Scholar blocked me from using search_pubs

I am using Pycharm Community Edition 2020.3.2, Scholarly version 1.0.2, Tor version 1.0.0.我正在使用 Pycharm 社区版 2020.3.2,学术版 1.0.2,Tor 版 1.0.0。 I tried to scrape 700 articles to find their numbers of citations.我试图抓取 700 篇文章来查找它们的引用次数。 Google Scholar blocked me from using search_pubs (a function of Scholarly). Google Scholar 阻止我使用 search_pubs(Scholarly 的 function)。 However, another function of Scholarly, which is search_author, is still working well.但是,Scholarly 的另一个 function,即 search_author,仍然运行良好。 In the beginning, search_pubs function worked properly.一开始,search_pubs function 工作正常。 I tried these codes.我试过这些代码。

from scholarly import scholarly
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

After a few trials, it shows the below error.经过几次试验,它显示以下错误。

Traceback (most recent call last):
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-3bbcfb742cb5>", line 1, in <module>
    scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_scholarly.py", line 121, in search_pubs
    return self.__nav.search_publications(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 256, in search_publications
    return _SearchScholarIterator(self, url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py", line 53, in __init__
    self._load_url(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py", line 58, in _load_url
    self._soup = self._nav._get_soup(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 200, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 152, in _get_page
    raise Exception("Cannot fetch the page from Google Scholar.")
Exception: Cannot fetch the page from Google Scholar.

Then, I figured out that the reason is I need to pass the CAPTCHA from Google in order to continue to fetch the info from Google Scholar.然后,我发现原因是我需要通过 Google 的 CAPTCHA 才能继续从 Google Scholar 获取信息。 Many people suggest that I need to use Proxy since my IP was blocked by Google.许多人建议我需要使用代理,因为我的 IP 被 Google 阻止了。 I tried to change Proxy using FreeProxies()我尝试使用 FreeProxies() 更改代理

from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

It does not work and Pycharm is frozen for a long time.它不起作用,Pycharm 被冻结了很长时间。 Then, I installed Tor (pip install Tor) and tried again:然后,我安装了 Tor(pip install Tor)并再次尝试:

from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.Tor_External(tor_sock_port=9050, tor_control_port=9051, tor_password="scholarly_password")
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

It does not work.这没用。 Then, I tried with SingleProxy()然后,我尝试了 SingleProxy()

from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.SingleProxy(https='socks5://127.0.0.1:9050',http='socks5://127.0.0.1:9050')
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

It also does not work.它也不起作用。 I have never tried Luminati since I am not familiar with it.我从来没有尝试过 Luminati,因为我不熟悉它。 If anyone knows the solution, please help!如果有人知道解决方案,请帮助!

As an alternative to a scholarly solution, you can try to use Google Scholar Organic Results API from SerpApi.作为scholarly解决方案的替代方案,您可以尝试使用来自 SerpApi 的Google Scholar Organic Results API。

It's a paid API with a free plan which handles bypassing blocks from Google or other search engines on their back end by solving CAPTCHA and rotating proxies so you don't have to.这是一个付费的 API 计划,它通过解决验证码和轮换代理来处理绕过谷歌或其他搜索引擎后端的块,这样你就不必这样做了。

Code and example in the online IDE : 在线 IDE 中的代码和示例

import os, json
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl

params = {
    # os.getenv(): https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),                 # your Serpapi API key
    "engine": "google_scholar",                      # search engine
    "q": "blizzard",                                 # search query
    "hl": "en",                                      # language
    # "as_ylo": "2017",                              # from 2017
    # "as_yhi": "2021",                              # to 2021
    "start": "0"                                     # first page
}

search = GoogleSearch(params)         # where data extraction happens

organic_results_data = []

papers_is_present = True
while papers_is_present:
    results = search.get_dict()      # JSON -> Python dictionary

    for publication in results["organic_results"]:
        organic_results_data.append({
            "page_number": results.get("serpapi_pagination", {}).get("current"),
            "result_type": publication.get("type"),
            "title": publication.get("title"),
            "link": publication.get("link"),
            "result_id": publication.get("result_id"),
            "summary": publication.get("publication_info").get("summary"),
            "snippet": publication.get("snippet"),
            })
             
        # paginates to the next page if the next page is present
        if "next" in results.get("serpapi_pagination", {}):
            search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
        else:
            papers_is_present = False

print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

Outputs in this case up to the 100th page:在这种情况下输出到第 100 页:

]
 {
    "page_number": 1,
    "result_type": null,
    "title": "Base catalyzed ring opening reactions of erythromycin A",
    "link": "https://www.sciencedirect.com/science/article/pii/S004040390074754X",
    "result_id": "RRGl5nJoIi4J",
    "summary": "ST Waddell, TA Blizzard - Tetrahedron letters, 1992 - Elsevier",
    "snippet": "While the direct opening of the lactone of erythromycin A by hydroxide to give the seco acid has so far proved elusive, two types of base catalyzed reactions which lead to rupture of the …"
  }, ... other results
  {
    "page_number": 100,
    "result_type": null,
    "title": "Síndrome de Johanson-Blizzard: importância do diagnóstico diferencial em pediatria",
    "link": "https://www.scielo.br/j/jped/a/Mc3X8DGcZSYQVqnL99kTtBH/abstract/?lang=pt",
    "result_id": "z_CmhVgEW2oJ",
    "summary": "MW Vieira, VLGS Lopes, H Teruya… - Jornal de …, 2002 - SciELO Brasil",
    "snippet": "… Description: we describe a Brazilian girl affected by Johanson-blizzard syndrome and review the literature. Comments: Johanson-Blizzard syndrome is an autosomal recessive …"
  }
]

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM