简体   繁体   English

如何从谷歌学术搜索结果(Python)中抓取完整的论文引用?

[英]How to scrape full paper citation from Google Scholar search results (Python)?

I am trying to web scrape some useful data on academic papers from Google Scholar.我正在尝试 web 从 Google 学术搜索中抓取一些关于学术论文的有用数据。 So far I've had no problem getting the Title, Year of publication, Citation count, and "Cited by" URL.到目前为止,我在获取标题、出版年份、引用次数和“引用次数”URL 时没有遇到任何问题。

I would like now to get the full citation that includes the full authors' list, journal, pages (if any) etc... (see snapshot below) Full APA citation appearing when clicking on the double quote (circled in red)我现在想获得完整的引文,包括完整的作者名单、期刊、页面(如果有的话)等……(见下面的快照)点击双引号时出现完整的 APA 引文(红色圆圈)

I use ScraperAPI to handle proxies and CAPTCHAs (they offer 5000 requests for free).我使用 ScraperAPI 来处理代理和验证码(他们免费提供 5000 个请求)。

Below is the code I have (I'm aware it's very heavy and not optimal at all, but does the job for now):下面是我的代码(我知道它很重而且根本不是最佳的,但现在可以完成):

import requests
import numpy as np
import pandas as pd
import re
from bs4 import BeautifulSoup

APIKEY = "????????????????????"
BASE_URL = f"http://api.scraperapi.com?api_key={APIKEY}&url="

def scraper_api(query, n_pages):
    """Uses scraperAPI to scrape Google Scholar for 
    papers' Title, Year, Citations, Cited By url returns a dataframe
    ---------------------------
    parameters:
    query: in the following format "automation+container+terminal"
    n_pages: number of pages to scrape
    ---------------------------
    returns:
    dataframe with the following columns: 
    "Title": title of each papers
    "Year": year of publication of each paper
    "Citations": citations count
    "cited_by_url": URL given by "cited by" button, reshaped to allow further
                    scraping
    ---------------------------"""

    pages = np.arange(0,(n_pages*10),10)
    papers = []
    for page in pages:
        print(f"Scraping page {int(page/10) + 1}")
        webpage = f"https://scholar.google.com/scholar?start={page}&q={query}&hl=fr&as_sdt=0,5"
        url = BASE_URL + webpage
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        
        for paper in soup.find_all("div", class_="gs_ri"):
            # get the title of each paper
            title = paper.find("h3", class_="gs_rt").find("a").text
            if title == None:
                title = paper.find("h3", class_="gs_rt").find("span").text
            # get the year of publication of each paper
            txt_year = paper.find("div", class_="gs_a").text
            year = re.findall('[0-9]{4}', txt_year)
            if year:
                year = list(map(int,year))[0]
            else:
                year = 0
            # get number of citations for each paper
            txt_cite = paper.find("div", class_="gs_fl").find_all("a")[2].string
            if txt_cite:
                citations = re.findall('[0-9]+', txt_cite)
                if citations:
                    citations = list(map(int,citations))[0]
                else:
                    citations = 0
            else:
                citations = 0
            # get the "cited_by" url for later scraping of citing papers
            # had to extract the "href" tag and then reshuffle the url as not
            # following same pattern for pagination
            urls = paper.find("div", class_="gs_fl").find_all(href=True)
            if urls:
                for url in urls:
                    if "cites" in url["href"]:
                        cited_url = url["href"]
                        index1 = cited_url.index("?")
                        url_slices = []
                        url_slices.append(cited_url[:index1+1])
                        url_slices.append(cited_url[index1+1:])

                        index_and = url_slices[1].index("&")
                        url_slices.append(url_slices[1][:index_and+1])
                        url_slices.append(url_slices[1][index_and+1:])
                        url_slices.append(url_slices[3][:23])
                        del url_slices[1]
                        new_url = "https://scholar.google.com.tw"+url_slices[0]+"start=00&hl=en&"+url_slices[3]+url_slices[1]+"scipsc="
            else:
                new_url = "no citations"
            # appends everything in a list of dictionaries    
            papers.append({'title': title, 'year': year, 'citations': citations, 'cited_by_url': new_url})
    # converts the list of dict to a pandas df
    papers_df = pd.DataFrame(papers)
    return papers_df

I would like to retrieve the full APA citation but seems like it's not on the same HTML page and there are no href associated.我想检索完整的 APA 引文,但似乎它不在同一个 HTML 页面上,并且没有关联的href

If you have any lead that would help me a lot:!如果您有任何线索对我有很大帮助:! Thanks :)谢谢 :)

Open F12, go under the.network tab then click on "citation symbol".在.network选项卡下打开F12,go然后点击“citation symbol”。 You should see a request appear.您应该会看到一个请求出现。 The url of the request is as:请求的url为:

"https://scholar.google.com/scholar?q=info:dgGDGDdf5:scholar.google.com/&output=cite&scirp=0&hl=fr" “https://scholar.google.com/scholar?q=info:dgGDGDdf5:scholar.google.com/&output=cite&scirp=0&hl=fr”

where "dgGDGDdf5" is the "data-cid" findable in each div-row of the main page.其中“dgGDGDdf5”是主页每个 div 行中可找到的“data-cid”。 Each "data-cid" correspond to an unique article.每个“data-cid”对应一篇独特的文章。

So, extract this "data-cid" and make a sub-request with this url then extract APA or other citation form.因此,提取此“data-cid”并使用此 url 发出子请求,然后提取 APA 或其他引用形式。

Implementation example:实现示例:

import requests as rq
from bs4 import BeautifulSoup as bs
from urllib.parse import urlencode

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"}
def google_scholar(query, n_pages, since_year):
    data = []
    encoded_query = urlencode({"q": query})
    for start in range(0, n_pages*10, 10):
        url = "https://scholar.google.com/scholar?as_ylo=%s&%s&hl=fr&start=%s" % (since_year, encoded_query, start)
        resp = rq.get(url, headers=headers)
        soup = bs(resp.content, "lxml")
        print(soup)
        main_div = soup.find_all('div', {'id': 'gs_res_ccl_mid'})[0]
        divs = main_div.find_all('div', {'class': 'gs_r gs_or gs_scl'})
        for div in divs:
            data_cid = div['data-cid']
            print(data_cid)
            title = div.find_all('h3', {'class': 'gs_rt'})[0].text
            infos = div.find_all('div', {'class': 'gs_a'})[0].text
            
            # APA citation
            url_cite = "https://scholar.google.com/scholar?q=info:%s:scholar.google.com/&output=cite&scirp=0&hl=fr" % (data_cid)
            resp2 = rq.get(url_cite, headers=headers)
            
            # --> extract apa here from resp2

data-cid attribute is a unique publication ID. data-cid属性是唯一的发布 ID。 You need to parse all of them from the page, make another request to citation URL with parsed data-cid as ce.teuf stated.您需要从页面解析所有这些,再次请求引用 URL,并使用ce.teuf所述的已解析data-cid

在此处输入图像描述

The example below will work for ~10-20 requests then Google will throw a CAPTCHA or you'll hit the rate limit.下面的示例适用于约 10-20 个请求,然后 Google 将抛出验证码,否则您将达到速率限制。 The ideal solution is to have a CAPTCHA solving service as well as proxies.理想的解决方案是拥有验证码解决服务和代理。


Example code:示例代码:

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "q": "automated container terminal",  # search query
    "hl": "en"                            # language
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.3538.102 Safari/537.36 Edge/18.19582",
    'accept-language': 'en-US,en',
    'accept': 'text/html,application/xhtml+xml,application/xml',
    "server": "scholar",
    "referer": f"https://scholar.google.com/scholar?hl={params['hl']}&q={params['q']}",
}


def cite_ids() -> list[str]:
    response = requests.get("https://scholar.google.com/scholar", params=params, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")

    # returns a list of publication ID's -> U8bh6Ca9uwQJ
    return [result["data-cid"] for result in soup.select(".gs_or")]

def scrape_cite_results() -> list[dict[str]]:
    cited_authors = []

    for cite_id in cite_ids():
        response = requests.get(f"https://scholar.google.com/scholar?output=cite&q=info:{cite_id}:scholar.google.com", headers=headers)
        soup = BeautifulSoup(response.text, "lxml")

        for result in soup.select("tr"):
            if "APA" in result.select_one("th"):
                title = result.select_one("th").text
                authors = result.select_one("td").text

                cited_authors.append({"title": title, "cited_authors": authors})

    return cited_authors

Alternatively, you can achieve it using Google Scholar Organic Results API from SerpApi.或者,您可以使用来自 SerpApi 的Google Scholar Organic Results API来实现它。 It's a paid API with a free plan.这是带有免费计划的付费 API。

The difference in such scenario is that you don't have to tinker selectors in order to find the proper one or figure out how to bypass blocks from Google if you send a bunch of requests and hit a IP rate limit or it will throw a CAPTCHA.这种情况的不同之处在于,如果您发送一堆请求并达到 IP 速率限制,则您不必为了找到合适的选择器而修改选择器,也不必弄清楚如何绕过 Google 的阻止,否则它会抛出验证码.

Code to integrate:集成代码:

import os, json
from serpapi import GoogleSearch


def organic_results() -> list[str]:
    params = {
        "api_key": os.getenv("API_KEY"),
        "engine": "google_scholar",
        "q": "automated container terminal",  # search query
        "hl": "en"                            # language
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    return [result["result_id"] for result in results["organic_results"]]


def cite_results() -> list[dict[str]]:

    citation_results = []

    for citation_id in organic_results():
        params = {
            "api_key": os.getenv("API_KEY"),
            "engine": "google_scholar_cite",
            "q": citation_id
        }

        search = GoogleSearch(params)
        results = search.get_dict()

        for result in results["citations"]:
            if "APA" in result["title"]:
                institution = result["title"]
                authors = result["snippet"]

                citation_results.append({
                    "institution": institution,
                    "authors": authors
                })

    return citation_results

print(json.dumps(cite_results(), indent=2))

'''
[
  {
    "institution": "APA",
    "authors": "Vis, I. F., & Harika, I. (2004). Comparison of vehicle types at an automated container terminal. OR Spectrum, 26(1), 117-143."
  },
  {
    "institution": "APA",
    "authors": "Vis, I. F., De Koster, R., Roodbergen, K. J., & Peeters, L. W. (2001). Determination of the number of automated guided vehicles required at a semi-automated container terminal. Journal of the Operational research Society, 52(4), 409-417."
  },
  {
    "institution": "APA",
    "authors": "Zhen, L., Lee, L. H., Chew, E. P., Chang, D. F., & Xu, Z. X. (2011). A comparative study on two types of automated container terminal systems. IEEE Transactions on Automation Science and Engineering, 9(1), 56-69."
  },
  {
    "institution": "APA",
    "authors": "Liu, C. I., Jula, H., & Ioannou, P. A. (2002). Design, simulation, and evaluation of automated container terminals. IEEE Transactions on intelligent transportation systems, 3(1), 12-26."
  },
  {
    "institution": "APA",
    "authors": "Park, T., Choe, R., Kim, Y. H., & Ryu, K. R. (2011). Dynamic adjustment of container stacking policy in an automated container terminal. International Journal of Production Economics, 133(1), 385-392."
  },
  {
    "institution": "APA",
    "authors": "Bae, H. Y., Choe, R., Park, T., & Ryu, K. R. (2011). Comparison of operations of AGVs and ALVs in an automated container terminal. Journal of Intelligent Manufacturing, 22(3), 413-426."
  },
  {
    "institution": "APA",
    "authors": "Luo, J., Wu, Y., & Mendes, A. B. (2016). Modelling of integrated vehicle scheduling and container storage problems in unloading process at an automated container terminal. Computers & Industrial Engineering, 94, 32-44."
  },
  {
    "institution": "APA",
    "authors": "Zhu, M., Fan, X., Cheng, H., & He, Q. (2010). Modeling and Simulation of Automated Container Terminal Operation. J. Comput., 5(6), 951-957."
  },
  {
    "institution": "APA",
    "authors": "Luo, J., & Wu, Y. (2020). Scheduling of container-handling equipment during the loading process at an automated container terminal. Computers & Industrial Engineering, 149, 106848."
  },
  {
    "institution": "APA",
    "authors": "Yang, X., Mi, W., Li, X., An, G., Zhao, N., & Mi, C. (2015). A simulation study on the design of a novel automated container terminal. IEEE Transactions on Intelligent Transportation Systems, 16(5), 2889-2899."
  }
]
'''

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Google Scholar 搜索结果中抓取和解析引文信息 - Scraping and parsing citation info from Google Scholar search results 如何在谷歌学者上为作者每年在网上抓取每篇论文的引用次数? - how to web scrape number of citations per paper per year for an author on google scholar? 使用 R 或 Python 按年份检索谷歌学者的搜索结果数量? - Retrieve google scholar number of search results by year using R or Python? Python 抓取google搜索结果 - Python scrape google search results 使用tor和python来刮掉谷歌学者 - Using tor and python to scrape Google Scholar 如何从 Google 搜索结果页面中抓取所有结果 (Python/Selenium ChromeDriver) - How to scrape all results from Google search results pages (Python/Selenium ChromeDriver) 使用 BeautifulSoup 和 Python 从 PubMed 搜索结果中抓取引文文本? - Scraping citation text from PubMed search results with BeautifulSoup and Python? 如何从谷歌学者中提取特定领域所有教授的信息(引文、h-index、当前工作机构等)? - How to extract information (citation, h-index, currently working institution etc) about all professors in a specific field from Google scholar? 如何通过漂亮的汤 python 从谷歌搜索结果中抓取嵌套 div 中的跨度 - How to scrape a span within nested div's from google search results through beautiful soup python 如何从 Google 搜索结果中抓取所有标题和链接(Python + Selenium) - How to scrape all the titles and links from Google search results (Python + Selenium)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM