简体   繁体   English

Google 学者搜索结果 web 抓取:问题只抓取以三点结尾的摘录(...)

[英]Google scholar search results web scraping: problem only scrapes excerpts ending with tripple dot (...)

I'm using the following code to scrape papers from google scholar.我正在使用以下代码从谷歌学者中抓取论文。 I noticed that only the shorted descriptions of the papers are scraped, but not the entire description.我注意到只有论文的简短描述被刮掉了,而不是整个描述。 If you look on the google scholar search results page, only a short excerpt from the text is seen ending with a triple dot (...)如果您查看 google 学者搜索结果页面,只能看到文本的一小段摘录以三点 (...)

The scraper only scrapes this, leaving the rest of the information out.刮板只刮掉这个,留下信息的rest。 This happens for authors (especially when there are many), journal names, and abstracts, leaving parts of the information out.这发生在作者(特别是当有很多作者时)、期刊名称和摘要中,而将部分信息排除在外。

Do you maybe know a solution to this?您可能知道解决此问题的方法吗? If you execute the code yourself you will see what I mean.如果您自己执行代码,您将明白我的意思。

from bs4 import BeautifulSoup
import requests, lxml, os, json


headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samsung",
  "hl": "en",
}

html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# Scrape just PDF links
for pdf_link in soup.select('.gs_or_ggsm a'):
  pdf_file_link = pdf_link['href']
  print(pdf_file_link)

# JSON data will be collected here
data = []

# Container where all needed data is located
for result in soup.select('.gs_ri'):
  title = result.select_one('.gs_rt').text
  title_link = result.select_one('.gs_rt a')['href']
  publication_info = result.select_one('.gs_a').text
  snippet = result.select_one('.gs_rs').text
  cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
  related_articles = result.select_one('a:nth-child(4)')['href']
  try:
    all_article_versions = result.select_one('a~ a+ .gs_nph')['href']
  except:
    all_article_versions = None

  data.append({
    'title': title,
    'title_link': title_link,
    'publication_info': publication_info,
    'snippet': snippet,
    'cited_by': f'https://scholar.google.com{cited_by}',
    'related_articles': f'https://scholar.google.com{related_articles}',
    'all_article_versions': f'https://scholar.google.com{all_article_versions}',
  })

print(json.dumps(data, indent = 2, ensure_ascii = False))

I think, I saw your code inScrape Google Scholar with Python blog post.我想,我在Scrape Google Scholar with Python中看到了您的代码。

This is because only part of the page's content is displayed in search results.这是因为只有部分页面内容显示在搜索结果中。 Mostly this information is related to your search question or written in advance.大多数情况下,这些信息与您的搜索问题相关或事先写好。

Therefore, it makes no sense to display all text in search results.因此,在搜索结果中显示所有文本是没有意义的。 If you are still interested in the full text, then you can follow each of the links and scrape the information you need.如果您仍然对全文感兴趣,那么您可以点击每个链接并抓取您需要的信息。 But keep in mind that each site uses its own selectors and the script will have to be rewritten.但请记住,每个站点都使用自己的选择器,并且必须重写脚本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM