[英]Why is my webscraper not detecting any changes?
I wanted to code a websraper with beautifulsoup4 and requests.我想用beautifulsoup4 和requests 编写一个websraper。 It scrapes the data of specifc columns of a specific table on a specifc table.
它在特定表上抓取特定表的特定列的数据。 It scrapes it once, waits a certain amount of time, scrapes it again and then compares both "scrapes".
它刮一次,等待一段时间,再刮一次,然后比较两个“刮”。 If there is a difference, it prints
"something has changed"
, and if there isn't, it prints "no changes"
如果有差异,则打印
"something has changed"
,如果没有,则打印"no changes"
Here is the entire code:这是整个代码:
import requests
import time
from bs4 import BeautifulSoup
URL = "https://website.com"
website = requests.get(URL)
soup = BeautifulSoup(website.content, "html.parser")
data = []
table = soup.find("table", class_="table table-bordered table-sm table-responsive")
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')[0]
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
cols2 = row.find_all('td')[1]
cols2 = [ele.text.strip() for ele in cols2]
data.append([ele for ele in cols2 if ele]) # Get rid of empty values
cols3 = row.find_all('td')[2]
cols3 = [ele.text.strip() for ele in cols3]
data.append([ele for ele in cols3 if ele]) # Get rid of empty values
cols4 = row.find_all('td')[3]
cols4 = [ele.text.strip() for ele in cols4]
data.append([ele for ele in cols4 if ele])
cols5 = row.find_all('td')[5]
cols5 = [ele.text.strip() for ele in cols5]
data.append([ele for ele in cols5 if ele])
print(cols, cols2, cols3, cols4, cols5)
time.sleep(600)
for row in rows:
cols11 = row.find_all('td')[0]
cols11 = [ele.text.strip() for ele in cols11]
data.append([ele for ele in cols11 if ele]) # Get rid of empty values
cols22 = row.find_all('td')[1]
cols22 = [ele.text.strip() for ele in cols22]
data.append([ele for ele in cols22 if ele]) # Get rid of empty values
cols33 = row.find_all('td')[2]
cols33 = [ele.text.strip() for ele in cols33]
data.append([ele for ele in cols33 if ele]) # Get rid of empty values
cols44 = row.find_all('td')[3]
cols44 = [ele.text.strip() for ele in cols44]
data.append([ele for ele in cols44 if ele])
cols55 = row.find_all('td')[5]
cols55 = [ele.text.strip() for ele in cols55]
data.append([ele for ele in cols55 if ele])
print(cols11, cols22, cols33, cols44, cols55)
if(cols == cols11, cols2 == cols22, cols5 == cols55):
print("no changes")
else:
print("something has changed")
Problem is: It always says "no changes"
even though I know that something had changed.问题是:它总是说
"no changes"
,即使我知道有些东西已经改变了。 How can fix this?如何解决这个问题?
While lists can be compared in this way, it's not clear how you reached the conclusion that you can use a comma ,
in place of a logical AND &&
operator in your if
condition.虽然可以通过这种方式比较列表,但尚不清楚您是如何得出可以使用逗号代替
if
条件,
的逻辑 AND &&
运算符的结论。
What you're doing here by wrapping your conditions in parenthesis ()
and joining them with a comma ,
(inadvertently, it would seem) is creating a tuple
structure;通过将条件括在括号
()
中并用逗号将它们连接起来,
您在这里所做的是创建一个tuple
结构; all non-empty tuple
s evaluate to True
.所有非空
tuple
的评估结果为True
。 Thus, your script is continually hitting the logic branch you feel should only be entered if there are no changes between any of your data structures.因此,您的脚本会不断地触及您认为只有在您的任何数据结构之间没有更改时才应该输入的逻辑分支。
Instead, use the logical AND &&
properly (and don't cast the truth values themselves into a tuple) as you seem to intend:相反,请按照您的意图正确使用逻辑 AND
&&
(并且不要将真值本身转换为元组):
if cols == cols11 && cols2 == cols22 && cols5 == cols55:
print("no changes")
else:
print("something has changed")
Tangential to the core of your question, but your code would benefit from (a) naming your variables in a much more descriptive manner, and (b) using datatypes that better fit your use case as opposed to introducing a brand-new numbered variable for every index and unnecessarily duplicating code.与您的问题的核心相切,但您的代码将受益于 (a) 以更具描述性的方式命名变量,以及 (b) 使用更适合您的用例的数据类型,而不是引入全新的编号变量每个索引和不必要的重复代码。
In addition to what others have said, you must make another GET request to the URL after pausing for a while inorder to detect any changes in the data of the webpage.除了其他人所说的,你必须在暂停一段时间后再次向URL发出GET请求,以检测网页数据的任何变化。
What you are doing is:你正在做的是:
soup
object of the response.soup
object。soup
and storing them in variables.soup
中提取数据并将它们存储在变量中。time.sleep(600)
time.sleep(600)
soup
- (which will always be equal) without making any new GET request.soup
中提取相同的信息—— (总是相等的)而不发出任何新的 GET 请求。 So you need to add this code right after time.sleep(600)
statement to get any modified data from webpage (if any).因此,您需要在
time.sleep(600)
语句之后立即添加此代码,以从网页(如果有)获取任何修改后的数据。
URL = "https://website.com"
website = requests.get(URL)
soup = BeautifulSoup(website.content, "html.parser")
table = soup.find("table", class_="table table-bordered table-sm table-responsive")
table_body = table.find('tbody')
rows = table_body.find_all('tr')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.