Function to remove duplicates from a list of tuples in python

Question

In the function sqlPull() I pull the most recent 5 entries from a MySQL database every 5 seconds. In the second function dupCatch() I am attempting to remove duplicates that would in the n+1 SQL pull when compared to n. I want to save only the unique list of tuples, but right now the function is printing the same list of tuples 5 times every five seconds.

In english what I am attempting to do with dupCatch() is take the data from sqlPull(), initialize and empty list and say for all of the tuples in the variable data if that tuple is not in the empty list, add it to the newData variable, if not, set lastPull equal to the non-unique tuples.

Obviously, my function is wrong, but I'm not sure how to fix it.

import mysql.connector
import datetime
import requests
from operator import itemgetter
import time

run = True

def sqlPull():
    connection = mysql.connector.connect(user='XXX', password='XXX', host='XXXX', database='MeshliumDB')
    cursor = connection.cursor()
    cursor.execute("SELECT TimeStamp, MAC, RSSI FROM wifiscan ORDER BY TimeStamp DESC LIMIT 5;")
    data = cursor.fetchall()
    connection.close()
    time.sleep(5)
    return data

def dupCatch():
    data = sqlPull()
    lastPull = []
    for (TimeStamp, MAC, RSSI) in data:
        if (TimeStamp, MAC, RSSI) not in lastPull:
            newData = data
        else:
            lastPull = data
        print newData

while run == True:
    dupCatch()

This is what the output I am getting now looks like:

[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]
[(datetime.datetime(2013, 11, 14, 20, 28, 54), u'E0:CB:1D:36:EE:9D', u' 20'), (datetime.datetime(2013, 11, 14, 20, 28, 53), u'00:1E:8F:75:82:35', u' 21'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'00:1E:4C:03:C0:66', u' 26'), (datetime.datetime(2013, 11, 14, 20, 28, 52), u'78:E4:00:0C:50:DF', u' 33')]

Answer 1

Assuming you're only trying to filter out adjacent repeats, not repeats ever seen…

First, the first time you find a tuple that's in lastPull , you're going to set lastPull = data . That means all of the subsequent tuples will automatically be in lastPull .

Meanwhile, you're setting either lastPull or newData each time through the loop. So, one of these is going to happen:

If all tuples are new, you will set newData (repeatedly) and not update lastPull .
If the first tuple is new, but at least one tuple is a repeat, you will set newData and also update lastPull .
If the first tuple is a repeat, you will only update lastPull .

This can't be the logic you wanted. I think what you want to use any or all , or to put a break in one of the conditions and put opposite in an else clause on the for , but I'm not honestly sure what you're trying to do here.

Meanwhile, your code always does a print newData each time through the loop. So, for each tuple, you're going to print all of the tuples. As mentioned above, this will always be the new ones if the first tuple is new, otherwise the previous ones. Again, this can't be what you want, but I'm not sure what you do want. Maybe you want to print newData outside the loop, instead of each time through?

On top of all that, you say you want to add things to the newData list, but in your code you're just replacing the variable over and over. To add things to a list, you need to call append on it. (Or extend , if you have a list of new things to add all in one go.)

Answer 2

Rather than try to figure our what your code is trying to do and fix it, let's go back to your English description:

In english what I am attempting to do with dupCatch() is take the data from sqlPull(), initialize and empty list and say for all of the tuples in the variable data if that tuple is not in the empty list, add it to the newData variable, if not, set lastPull equal to the non-unique tuples.

So:

seen = set()
def dupCatch():
    data = sqlPull()
    new_data = []
    for (TimeStamp, MAC, RSSI) in data:
        if (TimeStamp, MAC, RSSI) not in seen:
            seen.add((TimeStamp, MAC, RSSI))
            new_data.append((TimeStamp, MAC, RSSI))
    print new_data

Or, more concisely:

seen = set()
def dupCatch():
    data = sqlPull()
    newData = [row for row in data if row not in seen]
    seen.update(newData)
    print new_data

Either way, the trick here is that we have a set which keeps track of every row we've ever seen. So, for each new row, if it's in that set, we've seen it and can ignore it; otherwise, we have to not ignore it, and add it to the set for later.

The second version just simplifies things by filtering all 5 rows at once, and then update -ing the set with all of the new ones at once, instead of doing it row by row.

The reason that seen has to be global is that a global lives forever, across all runs of the function, so we can use it to keep track of every row we've ever seen; if we made it local to the function, it would be new each time, so we'd only be keeping track of rows we've seen in the current batch, which isn't very useful.

In general, globals are bad. However, things like persistent caches are an exception to the "in general" rule. The whole point of them is that they're not local. If you had an object model in mind that made sense, seen would be much better as a member of whatever object dupCatch was a method on than as a global. If you had a good reason to define the function as a closure inside another function, seen would be better as part of that closure. And so on. But otherwise, a global is the best option.

If you reorganized your code a bit, you could make this even simpler:

def pull():
    while True:
        for row in sqlPull():
            yield row
for row in unique_everseen(pull()):
    print row

… or even:

for row in unique_everseen(chain.from_iterable(iter(sqlPull, None))):
    print row

See Iterators and the next few tutorial sections, the itertools documentation, and David M. Beazley's presentations to understand what this last version does. But for a novice, you might want to stick with the second version.

Answer 3

Try this:

def dupCatch():
    data = sqlPull()
    lastPull = []
    for x in data:
        if x not in lastPull:
            print(x)
        lastPull.append(x)

Answer 4

The problem is that lastPull is a local variable, so it gets set to [] every time, and doesn't persist between function calls. For what you're trying to do, you should use a class and store the last pull there:

import mysql.connector
import datetime
import requests
import time

class SqlPuller(object):
    def __init__(self):
        self.last_pull = set()

    def pull(self):
        connection = mysql.connector.connect(user='XXX', password='XXX',
                host='XXXX', database='MeshliumDB')
        cursor = connection.cursor()
        cursor.execute("SELECT TimeStamp, MAC, RSSI FROM wifiscan ORDER BY TimeStamp DESC LIMIT 5;")
        data = cursor.fetchall()
        connection.close()
        return data

    def pull_new(self):
        new_data = []
        data = self.pull()
        for item in data:
            if item not in self.last_pull:
                new_data.append(item)
        self.last_pull = set(data)
        return new_data


if __name__ == "__main__":
    sql_puller = SqlPuller()
    while True:
        for item in sql_puller.pull():
            print(item)
            time.sleep(5)

Function to remove duplicates from a list of tuples in python

Question

4 answers

solution1
1 2013-11-15 01:46:16

solution2
1 ACCPTED 2013-11-15 01:50:38

solution3
0 2013-11-15 01:41:54

solution4
0 2013-11-15 01:42:07

Function to remove duplicates from a list of tuples in python

Question

4 answers

solution1 1 2013-11-15 01:46:16

solution2 1 ACCPTED 2013-11-15 01:50:38

solution3 0 2013-11-15 01:41:54

solution4 0 2013-11-15 01:42:07

solution1
1 2013-11-15 01:46:16

solution2
1 ACCPTED 2013-11-15 01:50:38

solution3
0 2013-11-15 01:41:54

solution4
0 2013-11-15 01:42:07