简体   繁体   English

Google Scholar 上的网页抓取不断返回一个空列表

[英]Webscraping on Google Scholar keeps returning an empty list

I am trying to webscrape for uni, but it's hard to do so from Google Scholar.我正在尝试为 uni 搜索网页,但从 Google Scholar 很难做到这一点。 I've tried many things and apparently it's got to do with .json() .我已经尝试了很多事情,显然它与.json()有关。

I want to make a function that inputs brands such as Apple and Samsung, and returns a list of headers with their respective abstracts.我想制作一个 function 输入苹果和三星等品牌,并返回带有各自摘要的标题列表。

Please could someone help me out here, Thank you.请有人在这里帮助我,谢谢。 Below, I've written what I have so far and hashed out some other things I've tried.下面,我已经写了到目前为止的内容,并列出了我尝试过的其他一些事情。

from bs4 import BeautifulSoup
import requests
import csv
import json

brand = input("Enter Technology:  ")
source = requests.get('https://scholar.google.com/scholar?0&q={0}+technology'.format(brand)).text
soup = BeautifulSoup(source, 'lxml')

#script = soup.select_one('[type="application/ld+json"]').text
#data = json.loads(script)
#soup = BeautifulSoup(data['description'], 'lxml')

headers = soup.find_all('div', class_="gs_rt")

print(headers)

The first thing you can do is to add proxies to your request:您可以做的第一件事是在您的请求中添加代理:

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

Request code will be like this:请求代码将是这样的:

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

Or, to go around it you can use selenium or requests-html or pyppeteer to render the page without using proxies, but it still might block your requests if you send too many at the same time.或者,对于 go,您可以使用seleniumrequests-htmlpyppeteer在不使用代理的情况下呈现页面,但如果您同时发送太多请求,它仍然可能会阻止您的请求。

'''
If you'll get an empty array, this means you get a CAPTCHA. 
Print response text to see what is going on or wait sometime before sending requests again.
'''

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=samsung&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    print(title)

Alternatively, you can scrape data from Google Scholar using Google Scholar API from SerpApi.或者,您可以使用来自 SerpApi 的Google Scholar API从 Google Scholar 抓取数据。 No need to think about how to bypass Google blocking or render a Javascript page.无需考虑如何绕过 Google 阻止或呈现 Javascript 页面。

It's a paid API with a free trial of 5,000 searches.它是付费的 API,可免费试用 5,000 次搜索。 A completely free trial is currently under development.目前正在开发完全免费的试用版。

Code to integrate:要集成的代码:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "samsung",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    print(f'Title: result['title']')

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

Google Scholar links to different sites like sciencedirect, acm etc... I have added selectors only for sciencedirect and acm. Google Scholar 链接到不同的网站,如sciencedirect、acm 等……我只为sciencedirect 和acm 添加了选择器。 You can add more if you want.如果需要,您可以添加更多。 Google scholar paginates using index like for page 1 start is 0, page 2 start is 10. The following script asks for brand, and number of pages to crawl. Google 学者使用索引分页,例如第 1 页start为 0,第 2 页start为 10。以下脚本要求提供品牌和要抓取的页数。 It saves 2 files - one json and one csv.它保存 2 个文件 - 一个 json 和一个 csv。

from bs4 import BeautifulSoup
import requests, time
import pandas as pd
import json

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

brand = input("Enter Technology:  ")
pages = int(input("Number of pages: "))
url = "https://scholar.google.com/scholar?start={}&q={}+technology&hl=en&as_sdt=0,5"

data = []
for i in range(0,pages*10+1,10):
    print(url.format(i, brand))
    res = requests.get(url.format(i, brand),headers=headers)
    main_soup = BeautifulSoup(res.text, "html.parser")
    divs = main_soup.find_all("div", class_="gs_r gs_or gs_scl")
    for div in divs:
        temp = {}
        h3 = div.find("h3", class_="gs_rt")
        temp["Link"] = h3.find("a")["href"]
        temp["Heading"] = h3.find("a").get_text(strip=True)
        temp["Authors"] = div.find("div",class_="gs_a").get_text(strip=True)
        print(temp["Link"])
        try:
            res_link = requests.get(temp["Link"], headers=headers)
            soup_link = BeautifulSoup(res_link.text,"html.parser")
            if "sciencedirect" in temp["Link"]:
                temp["Abstract"] = soup_link.find("div", class_="abstract author").find("div").get_text(strip=True)
            elif "acm" in temp["Link"]:
                temp["Abstract"] = soup_link.find("div", class_="abstractSection abstractInFull").get_text(strip=True)
        except: pass
        data.append(temp)
        time.sleep(1)

with open("data.json", "w") as f:
    json.dump(data,f)

pd.DataFrame(data).to_csv("data.csv", index=False)

Output: Output:

Link,Heading,Authors,Abstract
https://www.sciencedirect.com/science/article/pii/0149197096000078,Development of pyroprocessingtechnology,"JJ Laidler, JE Battles, WE Miller, JP Ackerman… - Progress in Nuclear …, 1997 - Elsevier","A compact, efficient method for recycling IFR fuel is being developed. This method, known as pyroprocessing, capitalizes on the use of metal fuel in the IFR and provides separation of actinide elements from fission products by means of an electrorefining step. The process of electrorefining is based on well-understood electrochemical concepts, the applications of which are described in this chapter. With only the addition of head-end processing steps, the pyroprocess can be applied with equal success to fuel types other than metal, enabling a symbiotic system wherein the IFR can be used to fission the actinide elements in spent nuclear fuel from other types of reactor."
https://www.sciencedirect.com/science/article/pii/S0041624X97001467,Acoustic wave sensors and theirtechnology,"MJ Vellekoop - Ultrasonics, 1998 - Elsevier","In the past two decades, acoustic-wave devices have gained enormous interest for sensor applications. The delay line device, where a transmitting and a receiving interdigital transducer are realized on a (piezoelectric) substrate is the most common structure used. The sensitive part is the surface between the two transducers. By placing the device in the feedback loop of an amplifier, an acoustic-wave oscillator is formed with properties such as inherent high sensitivity, high resolution, high stability and a frequency output signal which is easy to process.A very interesting development is the large amount of wave types now available for sensor applications. Sensors have been published using Rayleigh waves, Lamb waves, Love waves, acoustic plate modes, and surface transverse waves (STW). Each of these wave types have their special advantages and disadvantages with respect to sensitivity, stability, usability in liquids or gases, and fabrication complexity. For the fabrication of the acoustic-wave devices, planar technologies are used, which will be discussed in the paper. Examples will be given of gas sensors, biochemical sensors in liquids, viscosity and density sensing and high-voltage sensing. A comparison of the usability of the different wave types will be presented."
https://www.sciencedirect.com/science/article/pii/0167268188900558,Technologyand transaction cost economics: a reply,"OE Williamson - Journal of Economic Behavior & Organization, 1988 - Elsevier","I argue here, as I have previously, that technology is neither fully determinative of nor irrelevant to economic organization. Transaction cost economizing occupies a prominent position in any effort to assess the efficacy of alternative forms of economic organization."
https://www.sciencedirect.com/science/article/pii/0048733394900140,Learning by trying: the implementation of configurationaltechnology,"J Fleck- Research policy, 1994 - Elsevier","In this paper some issues concerning the nature of technological development are examined, with particular reference to a case study of the implementation of Computer Aided Production Management (CAPM). CAPM is an example of a configurational technology, built up to meet specific organizational requirements. It is argued that there is scope in the development of configurations for significant innovation to take place during implementation itself, through a distinctive form of learning by ‘struggling to get it to work’, or ‘learning by trying’. Some policy implications are outlined in conclusion: the need to recognize the creative opportunities available in this type of development, and the need to facilitate industrial sector-based learning processes."
...
...
...

Google scholar is javascript enable website Use selenium to scrape the site will be a perfect solution for more details refer here Google 学者是 javascript 启用网站 使用 selenium 抓取网站将是一个完美的解决方案,更多详细信息请参阅此处

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM