在python中以CSV比较数据

Question

I am using python 2.7 to clean some data from a CSV file before chucking it into a MySQL database. 我正在使用python 2.7从CSV文件中清除一些数据，然后再将其插入MySQL数据库。

Each row is a user entry with a timestamp. 每行都是带有时间戳的用户条目。

Before I send the data to the db, I want to check the CSV for duplicate rows (two rows with the same username), and then use the timers (on the same rows) to check which one to keep. 在将数据发送到数据库之前，我想检查CSV中是否存在重复的行（具有相同用户名的两行），然后使用计时器（位于同一行）检查要保留的行。

# set up data container
data = []

# read csv file
with open(file, 'rU') as f:
    # create file reader
    reader = csv.reader(f)

    # skip first row (headers)
    next(reader)

    # gather data in a table 
    for row in reader:
        data.append(row)

I think I am getting confused with comparing items in a 2d array... I know that the usernames are in data[][1] and the timer (int) is in data[][52] . 我想我对比较2d数组中的项目感到困惑...我知道用户名在data[][1] ，而计时器（int）在data[][52] 。

I tried to create a new list like this: 我试图创建一个像这样的新列表：

usernames = []
cleaner_list = data
for row in data:
    if row[1] is in usernames:
         # dupe
    else:
        usernames.append(row[2])

But I keep going out of range when trying to compare the data such as like this: 但是，当尝试像这样比较数据时，我一直超出范围：

if row[1] is in usernames:
      if row[52] > usernames[row[2]][52]:
            # delete row[52] from cleaner_data
      else:
            # delete the equivalent row in usernames from cleaner_data

I feel that I am overthinking this but I can't use a set as I need the data to stay in line. 我觉得我想得太过分了，但是我不能使用set因为我需要数据保持一致。 I thought about creating some sort of enum list of the unique usernames and filter the CSV column with that, but I wouldn't know how to keep the correct references to the row when I find a duplicate and need to check its timer before deleting it. 我曾考虑创建某种类型的唯一用户名的枚举列表，并使用它来过滤CSV列，但是当我发现重复的行并需要在删除前检查其计时器时，我不知道如何保持对行的正确引用。 Any help would be really appreciated! 任何帮助将非常感激！

Answer 1

I'd do the following: Keep a dictionary of users with associated last timestamps. 我将执行以下操作：保留具有相关最后时间戳记的用户字典。 If you find something newer while scanning the CSV, replace the old value. 如果在扫描CSV时发现较新的内容，请替换旧值。

cleaner_data = {}
for row in data:
  if row[1] not in cleaner_data:    # user name not yet seen: add
    cleaner_data[row[1]] = row
  else:
    if row[52] > cleaner_data[row[1]][52]:    # already seen, but newer timestamp: replace
      cleaner_data[row[1]] = row

在python中以CSV比较数据

问题描述

1 个解决方案

解决方案1
0 2016-12-03 14:11:39

在python中以CSV比较数据

问题描述

1 个解决方案

解决方案1 0 2016-12-03 14:11:39

解决方案1
0 2016-12-03 14:11:39