[英]Webscraping on Google Scholar keeps returning an empty list
I am trying to webscrape for uni, but it's hard to do so from Google Scholar.我正在尝试为 uni 搜索网页,但从 Google Scholar 很难做到这一点。 I've tried many things and apparently it's got to do with
.json()
.我已经尝试了很多事情,显然它与
.json()
有关。
I want to make a function that inputs brands such as Apple and Samsung, and returns a list of headers with their respective abstracts.我想制作一个 function 输入苹果和三星等品牌,并返回带有各自摘要的标题列表。
Please could someone help me out here, Thank you.请有人在这里帮助我,谢谢。 Below, I've written what I have so far and hashed out some other things I've tried.
下面,我已经写了到目前为止的内容,并列出了我尝试过的其他一些事情。
from bs4 import BeautifulSoup
import requests
import csv
import json
brand = input("Enter Technology: ")
source = requests.get('https://scholar.google.com/scholar?0&q={0}+technology'.format(brand)).text
soup = BeautifulSoup(source, 'lxml')
#script = soup.select_one('[type="application/ld+json"]').text
#data = json.loads(script)
#soup = BeautifulSoup(data['description'], 'lxml')
headers = soup.find_all('div', class_="gs_rt")
print(headers)
The first thing you can do is to add proxies to your request:您可以做的第一件事是在您的请求中添加代理:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
Request code will be like this:请求代码将是这样的:
html = requests.get('google scholar link', headers=headers, proxies=proxies).text
Or, to go around it you can use selenium
or requests-html
or pyppeteer
to render the page without using proxies, but it still might block your requests if you send too many at the same time.或者,对于 go,您可以使用
selenium
或requests-html
或pyppeteer
在不使用代理的情况下呈现页面,但如果您同时发送太多请求,它仍然可能会阻止您的请求。
'''
If you'll get an empty array, this means you get a CAPTCHA.
Print response text to see what is going on or wait sometime before sending requests again.
'''
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=samsung&btnG=')
# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()
# Container where data we need is located
for result in response.html.find('.gs_ri'):
title = result.find('.gs_rt', first = True).text
print(title)
Alternatively, you can scrape data from Google Scholar using Google Scholar API from SerpApi.或者,您可以使用来自 SerpApi 的Google Scholar API从 Google Scholar 抓取数据。 No need to think about how to bypass Google blocking or render a Javascript page.
无需考虑如何绕过 Google 阻止或呈现 Javascript 页面。
It's a paid API with a free trial of 5,000 searches.它是付费的 API,可免费试用 5,000 次搜索。 A completely free trial is currently under development.
目前正在开发完全免费的试用版。
Code to integrate:要集成的代码:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google_scholar",
"q": "samsung",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(f'Title: result['title']')
Disclaimer, I work for SerpApi.
免责声明,我为 SerpApi 工作。
Google Scholar links to different sites like sciencedirect, acm etc... I have added selectors only for sciencedirect and acm. Google Scholar 链接到不同的网站,如sciencedirect、acm 等……我只为sciencedirect 和acm 添加了选择器。 You can add more if you want.
如果需要,您可以添加更多。 Google scholar paginates using index like for page 1
start
is 0, page 2 start
is 10. The following script asks for brand, and number of pages to crawl. Google 学者使用索引分页,例如第 1 页
start
为 0,第 2 页start
为 10。以下脚本要求提供品牌和要抓取的页数。 It saves 2 files - one json and one csv.它保存 2 个文件 - 一个 json 和一个 csv。
from bs4 import BeautifulSoup
import requests, time
import pandas as pd
import json
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
brand = input("Enter Technology: ")
pages = int(input("Number of pages: "))
url = "https://scholar.google.com/scholar?start={}&q={}+technology&hl=en&as_sdt=0,5"
data = []
for i in range(0,pages*10+1,10):
print(url.format(i, brand))
res = requests.get(url.format(i, brand),headers=headers)
main_soup = BeautifulSoup(res.text, "html.parser")
divs = main_soup.find_all("div", class_="gs_r gs_or gs_scl")
for div in divs:
temp = {}
h3 = div.find("h3", class_="gs_rt")
temp["Link"] = h3.find("a")["href"]
temp["Heading"] = h3.find("a").get_text(strip=True)
temp["Authors"] = div.find("div",class_="gs_a").get_text(strip=True)
print(temp["Link"])
try:
res_link = requests.get(temp["Link"], headers=headers)
soup_link = BeautifulSoup(res_link.text,"html.parser")
if "sciencedirect" in temp["Link"]:
temp["Abstract"] = soup_link.find("div", class_="abstract author").find("div").get_text(strip=True)
elif "acm" in temp["Link"]:
temp["Abstract"] = soup_link.find("div", class_="abstractSection abstractInFull").get_text(strip=True)
except: pass
data.append(temp)
time.sleep(1)
with open("data.json", "w") as f:
json.dump(data,f)
pd.DataFrame(data).to_csv("data.csv", index=False)
Output: Output:
Link,Heading,Authors,Abstract
https://www.sciencedirect.com/science/article/pii/0149197096000078,Development of pyroprocessingtechnology,"JJ Laidler, JE Battles, WE Miller, JP Ackerman… - Progress in Nuclear …, 1997 - Elsevier","A compact, efficient method for recycling IFR fuel is being developed. This method, known as pyroprocessing, capitalizes on the use of metal fuel in the IFR and provides separation of actinide elements from fission products by means of an electrorefining step. The process of electrorefining is based on well-understood electrochemical concepts, the applications of which are described in this chapter. With only the addition of head-end processing steps, the pyroprocess can be applied with equal success to fuel types other than metal, enabling a symbiotic system wherein the IFR can be used to fission the actinide elements in spent nuclear fuel from other types of reactor."
https://www.sciencedirect.com/science/article/pii/S0041624X97001467,Acoustic wave sensors and theirtechnology,"MJ Vellekoop - Ultrasonics, 1998 - Elsevier","In the past two decades, acoustic-wave devices have gained enormous interest for sensor applications. The delay line device, where a transmitting and a receiving interdigital transducer are realized on a (piezoelectric) substrate is the most common structure used. The sensitive part is the surface between the two transducers. By placing the device in the feedback loop of an amplifier, an acoustic-wave oscillator is formed with properties such as inherent high sensitivity, high resolution, high stability and a frequency output signal which is easy to process.A very interesting development is the large amount of wave types now available for sensor applications. Sensors have been published using Rayleigh waves, Lamb waves, Love waves, acoustic plate modes, and surface transverse waves (STW). Each of these wave types have their special advantages and disadvantages with respect to sensitivity, stability, usability in liquids or gases, and fabrication complexity. For the fabrication of the acoustic-wave devices, planar technologies are used, which will be discussed in the paper. Examples will be given of gas sensors, biochemical sensors in liquids, viscosity and density sensing and high-voltage sensing. A comparison of the usability of the different wave types will be presented."
https://www.sciencedirect.com/science/article/pii/0167268188900558,Technologyand transaction cost economics: a reply,"OE Williamson - Journal of Economic Behavior & Organization, 1988 - Elsevier","I argue here, as I have previously, that technology is neither fully determinative of nor irrelevant to economic organization. Transaction cost economizing occupies a prominent position in any effort to assess the efficacy of alternative forms of economic organization."
https://www.sciencedirect.com/science/article/pii/0048733394900140,Learning by trying: the implementation of configurationaltechnology,"J Fleck- Research policy, 1994 - Elsevier","In this paper some issues concerning the nature of technological development are examined, with particular reference to a case study of the implementation of Computer Aided Production Management (CAPM). CAPM is an example of a configurational technology, built up to meet specific organizational requirements. It is argued that there is scope in the development of configurations for significant innovation to take place during implementation itself, through a distinctive form of learning by ‘struggling to get it to work’, or ‘learning by trying’. Some policy implications are outlined in conclusion: the need to recognize the creative opportunities available in this type of development, and the need to facilitate industrial sector-based learning processes."
...
...
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.