简体   繁体   English

为什么我的网络爬虫没有检测到任何变化?

[英]Why is my webscraper not detecting any changes?

I wanted to code a websraper with beautifulsoup4 and requests.我想用beautifulsoup4 和requests 编写一个websraper。 It scrapes the data of specifc columns of a specific table on a specifc table.它在特定表上抓取特定表的特定列的数据。 It scrapes it once, waits a certain amount of time, scrapes it again and then compares both "scrapes".它刮一次,等待一段时间,再刮一次,然后比较两个“刮”。 If there is a difference, it prints "something has changed" , and if there isn't, it prints "no changes"如果有差异,则打印"something has changed" ,如果没有,则打印"no changes"

Here is the entire code:这是整个代码:

import requests
import time
from bs4 import BeautifulSoup

URL = "https://website.com"
website = requests.get(URL)
soup = BeautifulSoup(website.content, "html.parser")


data = []
table = soup.find("table", class_="table table-bordered table-sm table-responsive")
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')[0]
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

    cols2 = row.find_all('td')[1]
    cols2 = [ele.text.strip() for ele in cols2]
    data.append([ele for ele in cols2 if ele])  # Get rid of empty values

    cols3 = row.find_all('td')[2]
    cols3 = [ele.text.strip() for ele in cols3]
    data.append([ele for ele in cols3 if ele])  # Get rid of empty values

    cols4 = row.find_all('td')[3]
    cols4 = [ele.text.strip() for ele in cols4]
    data.append([ele for ele in cols4 if ele])

    cols5 = row.find_all('td')[5]
    cols5 = [ele.text.strip() for ele in cols5]
    data.append([ele for ele in cols5 if ele])


    print(cols, cols2, cols3, cols4, cols5)

time.sleep(600)

for row in rows:
    cols11 = row.find_all('td')[0]
    cols11 = [ele.text.strip() for ele in cols11]
    data.append([ele for ele in cols11 if ele])  # Get rid of empty values

    cols22 = row.find_all('td')[1]
    cols22 = [ele.text.strip() for ele in cols22]
    data.append([ele for ele in cols22 if ele])  # Get rid of empty values

    cols33 = row.find_all('td')[2]
    cols33 = [ele.text.strip() for ele in cols33]
    data.append([ele for ele in cols33 if ele])  # Get rid of empty values

    cols44 = row.find_all('td')[3]
    cols44 = [ele.text.strip() for ele in cols44]
    data.append([ele for ele in cols44 if ele])

    cols55 = row.find_all('td')[5]
    cols55 = [ele.text.strip() for ele in cols55]
    data.append([ele for ele in cols55 if ele])


    print(cols11, cols22, cols33, cols44, cols55)


if(cols == cols11, cols2 == cols22, cols5 == cols55):
    print("no changes")
else:
    print("something has changed")

Problem is: It always says "no changes" even though I know that something had changed.问题是:它总是说"no changes" ,即使我知道有些东西已经改变了。 How can fix this?如何解决这个问题?

While lists can be compared in this way, it's not clear how you reached the conclusion that you can use a comma , in place of a logical AND && operator in your if condition.虽然可以通过这种方式比较列表,但尚不清楚您是如何得出可以使用逗号代替if条件,的逻辑 AND &&运算符的结论。

What you're doing here by wrapping your conditions in parenthesis () and joining them with a comma , (inadvertently, it would seem) is creating a tuple structure;通过将条件括在括号()中并用逗号将它们连接起来,您在这里所做的是创建一个tuple结构; all non-empty tuple s evaluate to True .所有非空tuple的评估结果为True Thus, your script is continually hitting the logic branch you feel should only be entered if there are no changes between any of your data structures.因此,您的脚本会不断地触及您认为只有在您的任何数据结构之间没有更改时才应该输入的逻辑分支。

Instead, use the logical AND && properly (and don't cast the truth values themselves into a tuple) as you seem to intend:相反,请按照您的意图正确使用逻辑 AND && (并且不要将真值本身转换为元组):

if cols == cols11 && cols2 == cols22 && cols5 == cols55:
    print("no changes")
else:
    print("something has changed")

Tangential to the core of your question, but your code would benefit from (a) naming your variables in a much more descriptive manner, and (b) using datatypes that better fit your use case as opposed to introducing a brand-new numbered variable for every index and unnecessarily duplicating code.与您的问题的核心相切,但您的代码将受益于 (a) 以更具描述性的方式命名变量,以及 (b) 使用更适合您的用例的数据类型,而不是引入全新的编号变量每个索引和不必要的重复代码。

In addition to what others have said, you must make another GET request to the URL after pausing for a while inorder to detect any changes in the data of the webpage.除了其他人所说的,你必须在暂停一段时间后再次向URL发出GET请求,以检测网页数据的任何变化。

What you are doing is:你正在做的是:

  1. Making a GET request to the URL向 URL 发出 GET 请求
  2. Create a soup object of the response.创建一个响应的soup object。
  3. Extracting the data from the soup and storing them in variables.soup中提取数据并将它们存储在变量中。
  4. Pause for a while - time.sleep(600)暂停一会 - time.sleep(600)
  5. Again extracting the same information from the same soup - (which will always be equal) without making any new GET request.再次从同一个soup中提取相同的信息—— (总是相等的)而不发出任何新的 GET 请求。

So you need to add this code right after time.sleep(600) statement to get any modified data from webpage (if any).因此,您需要在time.sleep(600)语句之后立即添加此代码,以从网页(如果有)获取任何修改后的数据。

URL = "https://website.com"
website = requests.get(URL)
soup = BeautifulSoup(website.content, "html.parser")

table = soup.find("table", class_="table table-bordered table-sm table-responsive")
table_body = table.find('tbody')

rows = table_body.find_all('tr')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM