简体   繁体   English

使用 lxml.html 抓取 Scopus

[英]Webscraping Scopus with lxml.html

I'm trying to webscrape Scopus with lxml.html (ultimately to create a list of document titles), but it seems no data is being stored from the page.content;我正在尝试使用 lxml.html 对 Scopus 进行网络抓取(最终创建文档标题列表),但似乎没有从 page.content 存储数据; the resulting list(tr_elements) ends up empty.结果列表(tr_elements)最终为空。

import requests
import lxml.html as lh

url = 'https://www.scopus.com/results/citedbyresults.uri?sort=plf-f&cite=2-s2.0-84939544008&src=s&nlo=&nlr=&nls=&imp=t&sid=fdbfeac69ab848bdff16425dc6937ffc&sot=cite&sdt=a&sl=0&origin=resultslist&offset=1&txGid=b63ddae0b71deb5a4615640f49db9904'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

Since the inspect element shows that rows have varying classes( https://i.stack.imgur.com/6QUvw.png ) I've also tried running it with tr_elements = doc.xpath("//tr[contains(@class, 'searchArea')]") specifying which rows to parse, but this also ends up in an empty list.由于检查元素显示行具有不同的类( https://i.stack.imgur.com/6QUvw.png )我也尝试使用tr_elements = doc.xpath("//tr[contains(@class, 'searchArea')]")指定要解析的行,但这也以空列表结束。 Any ideas?有任何想法吗?

I figured it out.我想到了。 Access denied |访问被拒绝 | www.scopus.com used Cloudflare to restrict access www.scopus.com使用 Cloudflare 限制访问

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM