简体   繁体   English

有没有办法让我们使用 Python 根据关键作品抓取 Google Scholar?

[英]Is there a way to we scrape Google Scholar based on key works using Python?

I am new to web scraping and was wondering if there was a way where the end result would be the title, abstract, year, publisher and authors of papers that came up when i try to scrape in google scholar for key words.我是网络抓取的新手,我想知道当我尝试在谷歌学者中抓取关键词时,最终结果是否会是论文的标题、摘要、年份、出版商和作者。 I am not really sure where to go from here.我不确定从这里去哪里。 I assume i need to keep a list of all the attributes i want but how do i search for them when web scraping?我假设我需要保留我想要的所有属性的列表,但是在网络抓取时如何搜索它们?

from bs4 import BeautifulSoup
import requests, lxml, os, json
import pandas as pd


headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "Mental Health in Women",
  "hl": "en",
}

html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

Your code is getting only soup object.您的代码仅获取soup对象。 You need to take the selector gs_ri which wraps all its contents (title, link, snippet, etc).您需要选择包含所有内容(标题、链接、片段等)的选择器gs_ri To find the desired selector, you can use the select() method .要找到所需的选择器,您可以使用select()方法 This method accepts a selector to search for and returns a list of all matched HTML elements.此方法接受选择器进行搜索并返回所有匹配的 HTML 元素的list

得到什么的插图

To iterate over all results on the page, we can use for loop and iterate the list of matched elements what select() method returned.要遍历页面上的所有结果,我们可以使用for循环并遍历select()方法返回的匹配元素list

To find title, link and so on you can use the select_one() method .要查找标题、链接等,您可以使用select_one()方法 This method is very similar to select() method, but this method will return the first matched HTML element.此方法与select()方法非常相似,但此方法将返回第一个匹配的 HTML 元素。 In order to extract text from there, you must use thetext method and use get('href') or ['href'] to extract attributes if you want to get link.为了从那里提取文本,如果要获取链接,必须使用text方法使用get('href')['href']来提取属性

To extract authors, publishers, year and abstract, you can use regular expression .要提取作者、出版商、年份和摘要,可以使用regular expression This is very convenient for parsing data based on a given pattern.这对于基于给定模式解析数据非常方便。

Below is a modified snippet of your code:以下是修改后的代码片段:

for result in soup.select(".gs_ri"):
    title = 'Title: ' + result.select_one(".gs_rt a").text
    link = 'Link: ' + result.select_one("a")["href"]

    # https://regex101.com/r/PdMQU6/1
    authors = 'Authors: ' + re.search(r'^(.*?)-', result.select_one(".gs_a").text).group(1)
    
    # https://regex101.com/r/JoQigB/1
    publisher = 'Publisher: ' + re.search(r'\d+\s?-\s?(.*)', result.select_one(".gs_a").text).group(1)
    
    # https://regex101.com/r/E6KGbS/1
    year = 'Year: ' + re.search(r'\d+', result.select_one(".gs_a").text)[0]
    abstract = 'Abstract: ' + result.select_one(".gs_rs").text

    print(title, link, authors, publisher, year, abstract, sep="\n", end="\n\n")

Also, make sure you're using request headers user-agent to act as a "real" user visit.此外,请确保您使用请求标头user-agent来充当“真实”用户访问。 Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request.因为默认requests user-agentpython-requests并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent . 检查你的user-agent是什么

Code and full example in online IDE : 在线 IDE 中的代码和完整示例

from bs4 import BeautifulSoup
import requests, lxml, re

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "Mental Health in Women",
    "hl": "en",  # language
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

for result in soup.select(".gs_ri"):
    title = 'Title: ' + result.select_one(".gs_rt a").text
    link = 'Link: ' + result.select_one("a")["href"]

    # https://regex101.com/r/PdMQU6/1
    authors = 'Authors: ' + re.search(r'^(.*?)-', result.select_one(".gs_a").text).group(1)
    
    # https://regex101.com/r/JoQigB/1
    publisher = 'Publisher: ' + re.search(r'\d+\s?-\s?(.*)', result.select_one(".gs_a").text).group(1)
    
    # https://regex101.com/r/E6KGbS/1
    year = 'Year: ' + re.search(r'\d+', result.select_one(".gs_a").text)[0]
    abstract = 'Abstract: ' + result.select_one(".gs_rs").text

    print(title, link, authors, publisher, year, abstract, sep="\n", end="\n\n")

Output:输出:

Title: Culture and mental health of women in South-East Asia
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1525125/
Authors: U Niaz, S Hassan 
Publisher: ncbi.nlm.nih.gov
Year: 2006
Abstract: … on mental health of South Asian women. Marked gender discrimination in South Asia has led 
to second class status of women in … Women's lack of empowerment and both financial and …

Title: Violence against women and mental health
Link: https://www.sciencedirect.com/science/article/pii/S2215036616302619
Authors: S Oram, H Khalifeh, LM Howard 
Publisher: Elsevier
Year: 2017
Abstract: … violence experienced and perpetrated by women and men, and provide … how mental health 
services can address violence against women but will also be relevant to how mental health …

... other results

If you don't want to fiddle around with finding proper selectors or regular expression patterns to extract data, have a look Scrape historic Google Scholar results using Python blog post at SerpApi that shows how to do it with API examples plus how to extract data from all pages.如果您不想摆弄寻找合适的选择器或正则表达式模式来提取数据,请查看Scrape historic Google Scholar results using Python ,该博客文章展示了如何使用 API 示例以及如何从中提取数据所有页面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM