简体   繁体   English

不要从汤中获取数据

[英]Don't get data from soup

I created bs4 web-scraping app with python.我用 python 创建了 bs4 网络抓取应用程序。 My program return empty list for review.我的程序返回空列表以供审核。 For soup program runs normally.对于汤程序运行正常。

from bs4 import BeautifulSoup
import requests
import pandas as pd

data = []
usernames = []
titles = []
comments = []

result = requests.get('https://www.kupujemprodajem.com/review.php?action=list')

soup = BeautifulSoup(result.text, 'html.parser')
review = soup.findAll('div', class_="single-review")
print(review)

for i in review:
    header = i.find('div', class_="single-review__header")
    footer = i.find('div', class_="comment-holder")
    username = header.find('a', class_="single-review__username").text
    title = header.find('div', class_="single-review__related-to").text
    comment = footer.find('div', class_="single-review__comment").text
    usernames.append(username)
    titles.append(title)
    comments.append(comment)

data.append(usernames)
data.append(titles)
data.append(comments)

print(data)

It isn't problem with class. class 没有问题。

It looks like the reason this doesn't work is because the website needs a login in order to access that page.看起来这不起作用的原因是该网站需要登录才能访问该页面。 If in a private tab in a browser you where to visit https://www.kupujemprodajem.com/review.php?action=list , it would just take you to a login page.如果您在浏览器的私人标签中访问https://www.kupujemprodajem.com/review.php?action=list ,它只会带您进入登录页面。

There's 2 paths I can think of that you could take here:我可以想到您可以在这里采取的两条路径:

  1. Reverse engineer how the login process works and use the requests library to make a request to login and get (most likely) the session cookie from that in order to be able to request pages that require sign in.对登录过程的工作原理进行逆向工程,并使用请求库发出登录请求,并从中获取(很可能)session cookie,以便能够请求需要登录的页面。

  2. (much simpler) use selenium instead. (更简单)使用 selenium 代替。 Selenium is a library that allows you to control a full browser instance, so you would be able to easily input credentials using this method. Selenium 是一个允许您控制完整浏览器实例的库,因此您可以使用此方法轻松输入凭据。 Beautiful soup on the other hand simply just parses html, so doing things like authenticating often take much more work in Beautiful Soup then they do in Selenium.另一方面,Beautiful Soup 只是简单地解析 html,因此在 Beautiful Soup 中进行身份验证之类的工作通常比在 Selenium 中所做的工作要多得多。 I'd definitely suggest looking into it if you haven't already.如果你还没有,我肯定会建议你调查一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM