简体   繁体   中英

Python remove couple of duplicates from List

i Know similar questions have already an answer but i think my case is a little bit different. I have a mysql database with a big table (40.000+ entries) Table structure is this :

    Field    |  Type       |Null |Key  |Default |   Extra   
    -----------------------------------------------------
    Messaggio|  longtext   |NO   |     |NULL    |
    Id       |  bigint(20) |NO   |     |NULL    |
    Data     |  date       |NO   |     |NULL    |
    Partito  |  text       |NO   |     |NULL    |
    Numero   |  bigint(23) |NO   |PRI  |NULL    |auto_increment

I have to remove duplicates of rows that have same values in 'Messaggio','Id' and 'Partito', for example:

 Messaggio |Id      | Data      | Partito    | numero   |
----------------------------------------------------------
long_text1 | 123    | somedate  | M5s        |  1       |
long_text1 | 123    | somedate  | M5s        |  2       |
long_text2 | 123    | somedate  | M5s        |  3       |

In this case i have to delete one of the first 2 entries.

i've tried this

db = MySQLdb.connect(host="localhost", port=xxxxx, user="xxxxxxx", passwd="xxxxxx", db="xxxxx", charset='utf8',  use_unicode=True)db.ping(True)

cursor = db.cursor()

cursor.execute("SET NAMES utf8;")

cursor.execute("SELECT `Messaggio`, `Id`, `Data`, `Partito`, `Numero` FROM `Statuses` WHERE 1")

data = cursor.fetchall()

data2 = (dict((x[0], x) for x in data).values()

print (data2)
print (len(data))
print (len(data2))

Output:

- a very long list
- 41804
- 39558

Is not clear to me what this code ( (dict((x[0], x) for x in data).values() ) do ( i'm pretty to new to python and also i'have to figure out how dictionary works). first tought was that it delete identical lists (with same values in the 5 fields) but this is not posible because field 'Numero' is AI so it cant have duplicates (i've checked with a query on Mysql and no duplicates of 'Numero' found)

My questions:

  1. Why that code removed about 2.000 items? It remove any kind of duplicates?

  2. What is the best way to obtain the results?

it removes all lines having the same Messaggio except the very last one, consider the following code:

>>> {1:2, 1:3}
{1: 3}

you are building a dict with multiple assignments to the same key, only the very last does persist

back to:

(dict((x[0], x) for x in data).values()

starting from the end, it lists values for a dictionary

>>> {1:'a', 2:'b'}.values()
['a', 'b']

the dict is created from a generator ("tuple of tuples"):

>>> dict(((1,'a'),(2,'b')))
{1: 'a', 2: 'b'}

the most inner part is like:

>>> list((x[0], x) for x in [[1,2,3], ['a','b','c']])
[(1, [1, 2, 3]), ('a', ['a', 'b', 'c'])]

so I think you want to use:

(dict((x[0], x[1], x[3]), x) for x in data).values()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM