简体   繁体   English

在Python中使用BS4,Selenium刮擦动态数据并避免重复

[英]Scraping Dynamic Data and Avoiding Duplicates with BS4, Selenium in Python

what I'm trying to do here is retrieve data from a dynamic page that constantly reloads with information. 我在这里要做的是从动态页面检索数据,不断重新加载信息。 The way I have it setup is that it refreshes every 60 seconds. 我设置的方式是每60秒刷新一次。 The issue is, that the old data does not get removed from the page, so when the program goes through the data after a refresh, there are duplicates. 问题是,旧数据不会从页面中删除,因此当程序在刷新后通过数据时,会有重复数据。

Note: The program sleeps at the beginning as initially there are no messages to scrape. 注意:程序在开始时就会休眠,因为最初没有要删除的消息。

I am looking for a way or a solution to use the last record (in this case would be messages[-1]) as a starting point for searches, as to prevent duplicates. 我正在寻找一种方法或解决方案来使用最后一条记录(在这种情况下将是消息[-1])作为搜索的起点,以防止重复。

Appreciate all help! 感谢所有帮助! Thank you. 谢谢。

driver.get(URL)
while 1==1:
    time.sleep(60)
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'})
    for message in messages:
        username = message.div.h2.span.strong.text
        text = message.find('div', attrs={'class':'markup'}).get_text()
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)

        usernames.append(username)
        timestamps.append(timestamp)
        texts.append(text)
        print(timestamp, username," : ",text)

I have created a temporary solution that checks each record before entering it into my SQLite3 database. 我创建了一个临时解决方案,在将每个记录输入我的SQLite3数据库之前检查每个记录。 The program is able to work by using "INSERT OR IGNORE". 该程序能够使用“INSERT OR IGNORE”工作。 Unfortunately, the program is constantly checking duplicates as it has no way of filtering out the data that has already been scraped. 不幸的是,该程序不断检查重复项,因为它无法过滤掉已经被删除的数据。 Below is my temporary solution: 以下是我的临时解决方案:

driver.get(URL)
while 1==1:
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'})
    for message in reversed(messages):
        username = message.div.h2.span.strong.text
        usernames.append(username)
        text = message.find('div', attrs={'class':'markup'}).get_text()
        text = text.replace('"', '')
        text = text.replace("'", "")
        username = username.replace('"', '')
        username = username.replace("'", "")
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)
        isbot = message.find('span', attrs={'class':'bot-tag'})
        if (isbot):
            username = '(BOT) ' + username
        sql = '''INSERT OR IGNORE INTO 'chats' ('timestamp', 'username', 'text') VALUES ("%s", "%s", "%s")''' % (timestamp, username, text)
        conn.executescript(sql)

I have found a solution using set.difference which works well. 我找到了一个使用set.difference的解决方案,效果很好。

In my problem, there's a set amount of data that exists at one time (lets say 10). 在我的问题中,一次存在一定数量的数据(比方说10)。 We want to get the new values without the old. 我们希望在没有旧的情况下获得新的价值。

 olddata = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 newdata = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
 unique_data = set(newdata).difference(olddata)

= =

 {11, 12, 13, 14}

Final Working Code: 最终工作守则:

while 1==1:
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'})
    messages_dedupe = set(messages).difference(oldlist)
    for message in messages_dedupe:
        username = message.div.h2.span.strong.text
        text = message.find('div', attrs={'class':'markup'}).get_text()
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)
        isbot = message.find('span', attrs={'class':'bot-tag'})
        if (isbot):
            username = '(BOT) ' + username
        usernames.append(username)
        timestamps.append(timestamp)
        texts.append(text)
        oldlist = messages
        sqlvalues = (username, timestamp, text)
        c.execute("INSERT OR IGNORE INTO db (username, timestamp, text) VALUES (?, ?, ?)", sqlvalues)
        conn.commit()
        print(timestamp, username,":",text)
    time.sleep(20)

So you're looking for a way to avoid checking each record for duplicates? 那么你正在寻找一种方法来避免检查每个记录的重复? Assuming each timestamp is a unique value and that reversed(messages) is in order from newest message to oldest message. 假设每个时间戳都是唯一值,并且从最新消息到最旧消息的顺序reversed(messages)

timestamp_array = []
while 1==1:
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'}) 
    for message in reversed(messages):
        username = message.div.h2.span.strong.text
        usernames.append(username)
        text = message.find('div', attrs={'class':'markup'}).get_text()
        text = text.replace('"', '')
        text = text.replace("'", "")
        username = username.replace('"', '')
        username = username.replace("'", "")
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)
        isbot = message.find('span', attrs={'class':'bot-tag'})
        if (isbot):
            username = '(BOT) ' + username
        if timestamp in timestamp_array:
            break
        timestamp_array.append(timestamp)
        sql = '''INSERT OR IGNORE INTO 'chats' ('timestamp', 'username', 'text') VALUES ("%s", "%s", "%s")''' % (timestamp, username, text)
        conn.executescript(sql)

This will break out of the for loop once the first duplicate is reached. 一旦达到第一个副本,这将突破for循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM