简体   繁体   中英

Scraping Dynamic Data and Avoiding Duplicates with BS4, Selenium in Python

what I'm trying to do here is retrieve data from a dynamic page that constantly reloads with information. The way I have it setup is that it refreshes every 60 seconds. The issue is, that the old data does not get removed from the page, so when the program goes through the data after a refresh, there are duplicates.

Note: The program sleeps at the beginning as initially there are no messages to scrape.

I am looking for a way or a solution to use the last record (in this case would be messages[-1]) as a starting point for searches, as to prevent duplicates.

Appreciate all help! Thank you.

driver.get(URL)
while 1==1:
    time.sleep(60)
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'})
    for message in messages:
        username = message.div.h2.span.strong.text
        text = message.find('div', attrs={'class':'markup'}).get_text()
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)

        usernames.append(username)
        timestamps.append(timestamp)
        texts.append(text)
        print(timestamp, username," : ",text)

I have created a temporary solution that checks each record before entering it into my SQLite3 database. The program is able to work by using "INSERT OR IGNORE". Unfortunately, the program is constantly checking duplicates as it has no way of filtering out the data that has already been scraped. Below is my temporary solution:

driver.get(URL)
while 1==1:
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'})
    for message in reversed(messages):
        username = message.div.h2.span.strong.text
        usernames.append(username)
        text = message.find('div', attrs={'class':'markup'}).get_text()
        text = text.replace('"', '')
        text = text.replace("'", "")
        username = username.replace('"', '')
        username = username.replace("'", "")
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)
        isbot = message.find('span', attrs={'class':'bot-tag'})
        if (isbot):
            username = '(BOT) ' + username
        sql = '''INSERT OR IGNORE INTO 'chats' ('timestamp', 'username', 'text') VALUES ("%s", "%s", "%s")''' % (timestamp, username, text)
        conn.executescript(sql)

I have found a solution using set.difference which works well.

In my problem, there's a set amount of data that exists at one time (lets say 10). We want to get the new values without the old.

 olddata = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 newdata = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
 unique_data = set(newdata).difference(olddata)

=

 {11, 12, 13, 14}

Final Working Code:

while 1==1:
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'})
    messages_dedupe = set(messages).difference(oldlist)
    for message in messages_dedupe:
        username = message.div.h2.span.strong.text
        text = message.find('div', attrs={'class':'markup'}).get_text()
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)
        isbot = message.find('span', attrs={'class':'bot-tag'})
        if (isbot):
            username = '(BOT) ' + username
        usernames.append(username)
        timestamps.append(timestamp)
        texts.append(text)
        oldlist = messages
        sqlvalues = (username, timestamp, text)
        c.execute("INSERT OR IGNORE INTO db (username, timestamp, text) VALUES (?, ?, ?)", sqlvalues)
        conn.commit()
        print(timestamp, username,":",text)
    time.sleep(20)

So you're looking for a way to avoid checking each record for duplicates? Assuming each timestamp is a unique value and that reversed(messages) is in order from newest message to oldest message.

timestamp_array = []
while 1==1:
    chat_page = driver.page_source
    chat_soup = BeautifulSoup(chat_page,'lxml')
    messages = chat_soup.findAll('div', attrs={'class':'message first'}) 
    for message in reversed(messages):
        username = message.div.h2.span.strong.text
        usernames.append(username)
        text = message.find('div', attrs={'class':'markup'}).get_text()
        text = text.replace('"', '')
        text = text.replace("'", "")
        username = username.replace('"', '')
        username = username.replace("'", "")
        timestamp = message.find('span', attrs={'class':'timestamp'}).get_text()
        today = str(datetime.date.today())
        timestamp = timestamp.replace('Today', today)
        isbot = message.find('span', attrs={'class':'bot-tag'})
        if (isbot):
            username = '(BOT) ' + username
        if timestamp in timestamp_array:
            break
        timestamp_array.append(timestamp)
        sql = '''INSERT OR IGNORE INTO 'chats' ('timestamp', 'username', 'text') VALUES ("%s", "%s", "%s")''' % (timestamp, username, text)
        conn.executescript(sql)

This will break out of the for loop once the first duplicate is reached.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM