簡體   English   中英

Web 在谷歌學術中抓取 beautifulsoup 和 selenium 在 python

[英]Web Scraping in Google Scholar with beautifulsoup and selenium in python

我正試圖從谷歌學術檔案中抓取。 我需要具有我指定的特殊規格的配置文件。 我在 Python 中使用 Beautifulsoup 和 selenium。例如,我需要一所大學的教授從事我指定的某些學科。 你有什么想法?

我的方式很慢,需要訪問每個個人資料頁面以檢查我的特殊規格。 如果你知道,請給我一個更快的方法。

如果存在一種更快更好的方法來完成這項工作,請說出來。

您可以像這樣在 url 中添加您需要的主題:

https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision+label:machine_learning

在這里,我正在尋找兩個領域的作者:計算機視覺和機器學習

您可以通過在 label 后面用雙引號"<univ. name>"添加大學名稱來實現,例如: label:computer_vision "Michigan State University" 通過這種方式,您將只能通過工作場所或 email(例如msu.edu )從密歇根 State 大學獲得作者,他們的興趣與計算機視覺直接相關。

注意:有時作者會寫簡短的大學縮寫,例如 Michigan University -> U.Michigan, 就像 Honglak Lee 所做的那樣

要也包含此異常,您可以使用 pipe | 符號,代表or ,我相信。 因此搜索查詢將變為: label:computer_vision "Michigan State University"|"U.Michigan" ,翻譯為 Michigan State University OR U.Michigan。

我找到的唯一一個地方,您可以 在我如何按標題搜索下的 Google 學術搜索提示中了解如何進行此類搜索查詢? 但是與搜索在某所大學工作的作者無關。 顯示的結果是通過反復試驗獲得的,這似乎是有效的。


在線 IDE 中的代碼和示例

from parsel import Selector
import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"', # search query 
    "hl": "en",                  # language
    "view_op": "search_authors"  # author results
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
# Make sure you're using your user-agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

profiles = []

for profile in selector.css(".gs_ai_chpr"):
    profile_name = profile.css(".gs_ai_name a::text").get()
    profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    profile_affiliation = profile.css('.gs_hlt::text').get()  # selects only university name without additional affiliation, e.g: Assistant Professor
    profile_email = profile.css(".gs_ai_eml::text").get()
    profile_interests = profile.css(".gs_ai_one_int::text").getall()

    profiles.append({
        "profile_name": profile_name,
        "profile_link": profile_link,
        "profile_affiliations": profile_affiliation,
        "profile_email": profile_email,
        "profile_interests": profile_interests
    })

print(json.dumps(profiles, indent=2))


# part of the output:
'''
[
  {
    "author_name": "Anil K. Jain",
    "author_link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
    "author_affiliations": "Michigan State University",
    "author_email": "Verified email at cse.msu.edu",
    "author_interests": [
      "Biometrics",
      "Computer vision",
      "Pattern recognition",
      "Machine learning",
      "Image processing"
    ]
  } # ...other profiles
]
'''

注意:我使用的是Parsel庫,而不是最流行的解析庫BeautifulSoup ,但它非常相似並支持 XPath,並且有自己的 CSS 偽元素支持,如::text::attr(<attribute>)


或者,您可以使用來自 SerpApi 的Google Scholar Profiles API實現相同的目的。 這是帶有免費計划的付費 API。

這種情況下的區別在於您不必弄清楚抓取部分,例如選擇正確的選擇器/XPath 來抓取數據或如何繞過搜索引擎的塊,以及如何縮放請求的數量。

要集成的示例代碼:

from serpapi import GoogleSearch
import os, json

params = {
    "api_key": os.getenv("API_KEY"),     # SerpApi API key
    "engine": "google_scholar_profiles", # SerpApi profiles parsing engine
    "hl": "en",                          # language
    "mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"' # search query
}

search = GoogleSearch(params)
results = search.get_dict()

for profile in results["profiles"]:
    print(json.dumps(profile, indent=2))

# part of the output:
'''
{
  "name": "Anil K. Jain",
  "link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
  "serpapi_link": "https://serpapi.com/search.json?author_id=g-_ZXGsAAAAJ&engine=google_scholar_author&hl=en",
  "author_id": "g-_ZXGsAAAAJ",
  "affiliations": "Michigan State University",
  "email": "Verified email at cse.msu.edu",
  "cited_by": 233876,
  "interests": [
    {
      "title": "Biometrics",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiometrics",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biometrics"
    },
    {
      "title": "Computer vision",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Acomputer_vision",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision"
    },
    {
      "title": "Pattern recognition",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apattern_recognition",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pattern_recognition"
    },
    {
      "title": "Machine learning",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amachine_learning",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:machine_learning"
    },
    {
      "title": "Image processing",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aimage_processing",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:image_processing"
    }
  ],
  "thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=g-_ZXGsAAAAJ&citpid=1"
} ... other profiles

'''

免責聲明,我為 SerpApi 工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM