![](/img/trans.png)
[英]Why am I getting repetitive output while trying to scrape data from Google Scholar?
[英]I am trying to scrape HTML from a site that requires a login but am not getting any data
我正在关注本教程,但在运行 python 时似乎无法获取任何数据。 我得到一个 200 的 HTTP 状态代码, status.ok
返回一个真值。 任何帮助都会很棒。 这就是我在终端中的响应:
[]
200
True
import requests
from lxml import html
USERNAME = "username@email.com"
PASSWORD = "legitpassword"
LOGIN_URL = "https://bitbucket.org/account/signin/?next=/"
URL = "https://bitbucket.org/dashboard/overview"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"csrfmiddlewaretoken": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_elems = tree.findall(".//span[@class='repo-name']")
bucket_names = [bucket_elem.text_content().replace("\n", "").strip() for bucket_elem in bucket_elems]
print bucket_names
print result.status_code
if __name__ == '__main__':
main()
xpath 是错误的,类 repo-name 没有 span,您可以使用以下命令从锚标记中获取 repo 名称:
bucket_elems = tree.xpath("//a[@class='execute repo-list--repo-name']")
bucket_names = [bucket_elem.text_content().strip() for bucket_elem in bucket_elems]
自从编写教程以来,html 显然发生了变化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.