[英]Is there a way to we scrape Google Scholar based on key works using Python?
I am new to web scraping and was wondering if there was a way where the end result would be the title, abstract, year, publisher and authors of papers that came up when i try to scrape in google scholar for key words.我是网络抓取的新手,我想知道当我尝试在谷歌学者中抓取关键词时,最终结果是否会是论文的标题、摘要、年份、出版商和作者。 I am not really sure where to go from here.
我不确定从这里去哪里。 I assume i need to keep a list of all the attributes i want but how do i search for them when web scraping?
我假设我需要保留我想要的所有属性的列表,但是在网络抓取时如何搜索它们?
from bs4 import BeautifulSoup
import requests, lxml, os, json
import pandas as pd
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "Mental Health in Women",
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
Your code is getting only soup
object.您的代码仅获取
soup
对象。 You need to take the selector gs_ri
which wraps all its contents (title, link, snippet, etc).您需要选择包含所有内容(标题、链接、片段等)的选择器
gs_ri
。 To find the desired selector, you can use the select()
method .要找到所需的选择器,您可以使用
select()
方法。 This method accepts a selector to search for and returns a list
of all matched HTML elements.此方法接受选择器进行搜索并返回所有匹配的 HTML 元素的
list
。
To iterate over all results on the page, we can use for
loop and iterate the list
of matched elements what select()
method returned.要遍历页面上的所有结果,我们可以使用
for
循环并遍历select()
方法返回的匹配元素list
。
To find title, link and so on you can use the select_one()
method .要查找标题、链接等,您可以使用
select_one()
方法。 This method is very similar to select()
method, but this method will return the first matched HTML element.此方法与
select()
方法非常相似,但此方法将返回第一个匹配的 HTML 元素。 In order to extract text from there, you must use thetext
method and use get('href')
or ['href']
to extract attributes if you want to get link.为了从那里提取文本,如果要获取链接,必须使用
text
方法并使用get('href')
或['href']
来提取属性。
To extract authors, publishers, year and abstract, you can use regular expression
.要提取作者、出版商、年份和摘要,可以使用
regular expression
。 This is very convenient for parsing data based on a given pattern.这对于基于给定模式解析数据非常方便。
Below is a modified snippet of your code:以下是修改后的代码片段:
for result in soup.select(".gs_ri"):
title = 'Title: ' + result.select_one(".gs_rt a").text
link = 'Link: ' + result.select_one("a")["href"]
# https://regex101.com/r/PdMQU6/1
authors = 'Authors: ' + re.search(r'^(.*?)-', result.select_one(".gs_a").text).group(1)
# https://regex101.com/r/JoQigB/1
publisher = 'Publisher: ' + re.search(r'\d+\s?-\s?(.*)', result.select_one(".gs_a").text).group(1)
# https://regex101.com/r/E6KGbS/1
year = 'Year: ' + re.search(r'\d+', result.select_one(".gs_a").text)[0]
abstract = 'Abstract: ' + result.select_one(".gs_rs").text
print(title, link, authors, publisher, year, abstract, sep="\n", end="\n\n")
Also, make sure you're using request headers user-agent
to act as a "real" user visit.此外,请确保您使用请求标头
user-agent
来充当“真实”用户访问。 Because default requests
user-agent
is python-requests
and websites understand that it's most likely a script that sends a request.因为默认
requests
user-agent
是python-requests
并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent
. 检查你的
user-agent
是什么。
Code and full example in online IDE : 在线 IDE 中的代码和完整示例:
from bs4 import BeautifulSoup
import requests, lxml, re
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "Mental Health in Women",
"hl": "en", # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".gs_ri"):
title = 'Title: ' + result.select_one(".gs_rt a").text
link = 'Link: ' + result.select_one("a")["href"]
# https://regex101.com/r/PdMQU6/1
authors = 'Authors: ' + re.search(r'^(.*?)-', result.select_one(".gs_a").text).group(1)
# https://regex101.com/r/JoQigB/1
publisher = 'Publisher: ' + re.search(r'\d+\s?-\s?(.*)', result.select_one(".gs_a").text).group(1)
# https://regex101.com/r/E6KGbS/1
year = 'Year: ' + re.search(r'\d+', result.select_one(".gs_a").text)[0]
abstract = 'Abstract: ' + result.select_one(".gs_rs").text
print(title, link, authors, publisher, year, abstract, sep="\n", end="\n\n")
Output:输出:
Title: Culture and mental health of women in South-East Asia
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525125/
Authors: U Niaz, S Hassan
Publisher: ncbi.nlm.nih.gov
Year: 2006
Abstract: … on mental health of South Asian women. Marked gender discrimination in South Asia has led
to second class status of women in … Women's lack of empowerment and both financial and …
Title: Violence against women and mental health
Link: https://www.sciencedirect.com/science/article/pii/S2215036616302619
Authors: S Oram, H Khalifeh, LM Howard
Publisher: Elsevier
Year: 2017
Abstract: … violence experienced and perpetrated by women and men, and provide … how mental health
services can address violence against women but will also be relevant to how mental health …
... other results
If you don't want to fiddle around with finding proper selectors or regular expression patterns to extract data, have a look Scrape historic Google Scholar results using Python
blog post at SerpApi that shows how to do it with API examples plus how to extract data from all pages.如果您不想摆弄寻找合适的选择器或正则表达式模式来提取数据,请查看
Scrape historic Google Scholar results using Python
,该博客文章展示了如何使用 API 示例以及如何从中提取数据所有页面。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.