Web 在谷歌學術中抓取 beautifulsoup 和 selenium 在 python

Question

我正試圖從谷歌學術檔案中抓取。 我需要具有我指定的特殊規格的配置文件。 我在 Python 中使用 Beautifulsoup 和 selenium。例如，我需要一所大學的教授從事我指定的某些學科。 你有什么想法？

我的方式很慢，需要訪問每個個人資料頁面以檢查我的特殊規格。 如果你知道，請給我一個更快的方法。

如果存在一種更快更好的方法來完成這項工作，請說出來。

Answer 1

您可以像這樣在 url 中添加您需要的主題：

https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision+label:machine_learning

在這里，我正在尋找兩個領域的作者：計算機視覺和機器學習

Answer 2

您可以通過在 label 后面用雙引號"<univ. name>"添加大學名稱來實現，例如： label:computer_vision "Michigan State University" 。 通過這種方式，您將只能通過工作場所或 email（例如msu.edu ）從密歇根 State 大學獲得作者，他們的興趣與計算機視覺直接相關。

注意：有時作者會寫簡短的大學縮寫，例如 Michigan University -> U.Michigan，就像 Honglak Lee 所做的那樣。

要也包含此異常，您可以使用 pipe | 符號，代表or ，我相信。 因此搜索查詢將變為： label:computer_vision "Michigan State University"|"U.Michigan" ，翻譯為 Michigan State University OR U.Michigan。

我找到的唯一一個地方，您可以在我如何按標題搜索下的 Google 學術搜索提示中了解如何進行此類搜索查詢？ 但是與搜索在某所大學工作的作者無關。 顯示的結果是通過反復試驗獲得的，這似乎是有效的。

在線 IDE 中的代碼和示例：

from parsel import Selector
import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"', # search query 
    "hl": "en",                  # language
    "view_op": "search_authors"  # author results
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
# Make sure you're using your user-agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

profiles = []

for profile in selector.css(".gs_ai_chpr"):
    profile_name = profile.css(".gs_ai_name a::text").get()
    profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    profile_affiliation = profile.css('.gs_hlt::text').get()  # selects only university name without additional affiliation, e.g: Assistant Professor
    profile_email = profile.css(".gs_ai_eml::text").get()
    profile_interests = profile.css(".gs_ai_one_int::text").getall()

    profiles.append({
        "profile_name": profile_name,
        "profile_link": profile_link,
        "profile_affiliations": profile_affiliation,
        "profile_email": profile_email,
        "profile_interests": profile_interests
    })

print(json.dumps(profiles, indent=2))


# part of the output:
'''
[
  {
    "author_name": "Anil K. Jain",
    "author_link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
    "author_affiliations": "Michigan State University",
    "author_email": "Verified email at cse.msu.edu",
    "author_interests": [
      "Biometrics",
      "Computer vision",
      "Pattern recognition",
      "Machine learning",
      "Image processing"
    ]
  } # ...other profiles
]
'''

注意：我使用的是Parsel庫，而不是最流行的解析庫BeautifulSoup ，但它非常相似並支持 XPath，並且有自己的 CSS 偽元素支持，如::text或::attr(<attribute>) 。

或者，您可以使用來自 SerpApi 的Google Scholar Profiles API實現相同的目的。 這是帶有免費計划的付費 API。

這種情況下的區別在於您不必弄清楚抓取部分，例如選擇正確的選擇器/XPath 來抓取數據或如何繞過搜索引擎的塊，以及如何縮放請求的數量。

要集成的示例代碼：

from serpapi import GoogleSearch
import os, json

params = {
    "api_key": os.getenv("API_KEY"),     # SerpApi API key
    "engine": "google_scholar_profiles", # SerpApi profiles parsing engine
    "hl": "en",                          # language
    "mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"' # search query
}

search = GoogleSearch(params)
results = search.get_dict()

for profile in results["profiles"]:
    print(json.dumps(profile, indent=2))

# part of the output:
'''
{
  "name": "Anil K. Jain",
  "link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
  "serpapi_link": "https://serpapi.com/search.json?author_id=g-_ZXGsAAAAJ&engine=google_scholar_author&hl=en",
  "author_id": "g-_ZXGsAAAAJ",
  "affiliations": "Michigan State University",
  "email": "Verified email at cse.msu.edu",
  "cited_by": 233876,
  "interests": [
    {
      "title": "Biometrics",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiometrics",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biometrics"
    },
    {
      "title": "Computer vision",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Acomputer_vision",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision"
    },
    {
      "title": "Pattern recognition",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apattern_recognition",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pattern_recognition"
    },
    {
      "title": "Machine learning",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amachine_learning",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:machine_learning"
    },
    {
      "title": "Image processing",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aimage_processing",
      "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:image_processing"
    }
  ],
  "thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=g-_ZXGsAAAAJ&citpid=1"
} ... other profiles

'''

免責聲明，我為 SerpApi 工作。

Web 在谷歌學術中抓取 beautifulsoup 和 selenium 在 python

問題描述

2 個解決方案

解決方案1
0 已采納 2020-05-04 06:33:08

解決方案2
0 已采納 2022-02-14 11:02:10

Web 在谷歌學術中抓取 beautifulsoup 和 selenium 在 python

問題描述

2 個解決方案

解決方案1 0 已采納 2020-05-04 06:33:08

解決方案2 0 已采納 2022-02-14 11:02:10

解決方案1
0 已采納 2020-05-04 06:33:08

解決方案2
0 已采納 2022-02-14 11:02:10