[英]Web Scraping in Google Scholar with beautifulsoup and selenium in python
我正試圖從谷歌學術檔案中抓取。 我需要具有我指定的特殊規格的配置文件。 我在 Python 中使用 Beautifulsoup 和 selenium。例如,我需要一所大學的教授從事我指定的某些學科。 你有什么想法?
我的方式很慢,需要訪問每個個人資料頁面以檢查我的特殊規格。 如果你知道,請給我一個更快的方法。
如果存在一種更快更好的方法來完成這項工作,請說出來。
您可以像這樣在 url 中添加您需要的主題:
在這里,我正在尋找兩個領域的作者:計算機視覺和機器學習
您可以通過在 label 后面用雙引號"<univ. name>"
添加大學名稱來實現,例如: label:computer_vision "Michigan State University"
。 通過這種方式,您將只能通過工作場所或 email(例如msu.edu
)從密歇根 State 大學獲得作者,他們的興趣與計算機視覺直接相關。
注意:有時作者會寫簡短的大學縮寫,例如 Michigan University -> U.Michigan, 就像 Honglak Lee 所做的那樣。
要也包含此異常,您可以使用 pipe |
符號,代表or
,我相信。 因此搜索查詢將變為: label:computer_vision "Michigan State University"|"U.Michigan"
,翻譯為 Michigan State University OR U.Michigan。
我找到的唯一一個地方,您可以 在我如何按標題搜索下的 Google 學術搜索提示中了解如何進行此類搜索查詢? 但是與搜索在某所大學工作的作者無關。 顯示的結果是通過反復試驗獲得的,這似乎是有效的。
from parsel import Selector
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"', # search query
"hl": "en", # language
"view_op": "search_authors" # author results
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
# Make sure you're using your user-agent: https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
profiles = []
for profile in selector.css(".gs_ai_chpr"):
profile_name = profile.css(".gs_ai_name a::text").get()
profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
profile_affiliation = profile.css('.gs_hlt::text').get() # selects only university name without additional affiliation, e.g: Assistant Professor
profile_email = profile.css(".gs_ai_eml::text").get()
profile_interests = profile.css(".gs_ai_one_int::text").getall()
profiles.append({
"profile_name": profile_name,
"profile_link": profile_link,
"profile_affiliations": profile_affiliation,
"profile_email": profile_email,
"profile_interests": profile_interests
})
print(json.dumps(profiles, indent=2))
# part of the output:
'''
[
{
"author_name": "Anil K. Jain",
"author_link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
"author_affiliations": "Michigan State University",
"author_email": "Verified email at cse.msu.edu",
"author_interests": [
"Biometrics",
"Computer vision",
"Pattern recognition",
"Machine learning",
"Image processing"
]
} # ...other profiles
]
'''
注意:我使用的是Parsel
庫,而不是最流行的解析庫BeautifulSoup
,但它非常相似並支持 XPath,並且有自己的 CSS 偽元素支持,如::text
或::attr(<attribute>)
。
或者,您可以使用來自 SerpApi 的Google Scholar Profiles API實現相同的目的。 這是帶有免費計划的付費 API。
這種情況下的區別在於您不必弄清楚抓取部分,例如選擇正確的選擇器/XPath 來抓取數據或如何繞過搜索引擎的塊,以及如何縮放請求的數量。
要集成的示例代碼:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar_profiles", # SerpApi profiles parsing engine
"hl": "en", # language
"mauthors": 'label:computer_vision "Michigan State University"|"U.Michigan"' # search query
}
search = GoogleSearch(params)
results = search.get_dict()
for profile in results["profiles"]:
print(json.dumps(profile, indent=2))
# part of the output:
'''
{
"name": "Anil K. Jain",
"link": "https://scholar.google.com/citations?hl=en&user=g-_ZXGsAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=g-_ZXGsAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "g-_ZXGsAAAAJ",
"affiliations": "Michigan State University",
"email": "Verified email at cse.msu.edu",
"cited_by": 233876,
"interests": [
{
"title": "Biometrics",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiometrics",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biometrics"
},
{
"title": "Computer vision",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Acomputer_vision",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:computer_vision"
},
{
"title": "Pattern recognition",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apattern_recognition",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pattern_recognition"
},
{
"title": "Machine learning",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amachine_learning",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:machine_learning"
},
{
"title": "Image processing",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aimage_processing",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:image_processing"
}
],
"thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=g-_ZXGsAAAAJ&citpid=1"
} ... other profiles
'''
免責聲明,我為 SerpApi 工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.