简体   繁体   English

如何在python列表中找到彼此相邻的重复项并根据它们的索引列出它们?

[英]How to find duplicates in a python list that are adjacent to each other and list them with respect to their indices?

I have a program that reads a .csv file, checks for any mismatch in column length (by comparing it to the header-fields), which then returns everything it found out as a list (and then writes it into a file). 我有一个程序读取.csv文件,检查列长度的任何不匹配(通过将它与标题字段进行比较),然后返回它作为列表找到的所有内容(然后将其写入文件)。 What I want to do with this list, is to list out the results as follows: 我想用这个列表做的是列出如下结果:

row numbers where the same mismatch is found : the amount of columns in that row 找到相同不匹配的行号:该行中的列数

eg 例如

rows: n-m : y

where n and m are the numbers of rows which share the same amount of columns that mismatch to header. 其中n和m是共享与标题不匹配的相同列数的行数。

I have looked into these topics, and while the information is useful, they do not answer the question: 我已经研究了这些主题,虽然这些信息很有用,但他们没有回答这个问题:

Find and list duplicates in a list? 在列表中查找并列出重复项?

Identify duplicate values in a list in Python 在Python中标识列表中的重复值

This is where I am right now: 这就是我现在所处的位置:

r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
        # adds column length to a list
        colm = len(row)
        columns.append(colm)

b = len(columns)
for a in range(b):
        # checks if the current member matches the header length of columns
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                file.write("row  " + str(a + 1) + ": " + str(columns[a]) + " \n")

the file output looks like this: 文件输出如下所示:

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 
row  7224: 0 
row  7225: 1 
row  7226: 1 

when the desired end result is 当期望的最终结果是

rows 7220 - 7224 : 0
rows 7225 - 7226 : 1

So I what I essentially need, the way i see it, is an dictionary where key is the rows with duplicate value and value is the amount of columns in that said mismatch. 所以我基本上需要的,就是我看到它的方式,是一个字典,其中key是具有重复值的行,value是所述不匹配中的列数。 What I essentially think I need (in a horrible written pseudocode, that doesn't make any sense now that I'm reading it years after writing this question), is here: 我基本上认为我需要的东西(在一个可怕的书面伪代码中,现在我写完这个问题几年后就没有任何意义了),在这里:

def pseudoList():
    i = 1
    ListOfLists = []
    while (i < len(originalList)):
        duplicateList = []
        if originalList[i] == originalList[i-1]:
            duplicateList.append(originalList[i])
        i += 1
    ListOfLists.append(duplicateList)


def PseudocreateDict(ListOfLists):
    pseudoDict = {}
    for x in ListOfLists:
        a = ListOfLists[x][0]                   #this is the first node in the uniqueList created
        i = len(ListOfLists) - 1
        b = listOfLists[x][i]   #this is the last node of the uniqueList created
        pseudodict.update('key' : '{} - {}'.format(a,b))

This however, seems very convoluted way for doing what I want, so I was wondering if there's a) more efficient way b) an easier way to do this? 然而,这似乎是非常复杂的方式来做我想要的,所以我想知道是否有a)更有效的方式b)更简单的方法来做到这一点?

You can also try the following code - 您还可以尝试以下代码 -

b = len(columns)
check = 0
for a in range(b):
        # checks if the current member matches the header length of columns
        if check != 0 and columns[a] == check:
            continue
        elif check != 0 and columns[a] != check:
            check = 0
            if start != a:
                file.write("row  " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
            else:
                file.write("row  " + str(start) + ": " + str(columns[a]) + " \n")
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                start = a+1
                check = columns[a]

You can use a list comprehension to return a list of elements in the columns list that differ from adjacent elements, which will be the end-points of your ranges. 您可以使用列表推导来返回列列表中与相邻元素不同的元素列表,这些元素将是范围的终点。 Then enumerate these ranges and print/write out those that differ from the first (header) element. 然后枚举这些范围并打印/写出与第一个(标题)元素不同的范围。 An extra element is added to the list of ranges to specify the end index of the list, to avoid out of range indexing. 将一个额外元素添加到范围列表中以指定列表的结束索引,以避免超出范围索引。

columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];

ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element 
for i,v in enumerate(ranges[:-1]):
    if v[1] != columns[0]:
        print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]

output: 输出:

rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1

What you want to do is a map/reduce operation, but without the sorting that is normally done between the mapping and the reducing. 你想要做的是map / reduce操作,但没有通常在映射和reduce之间进行的排序。

If you output 如果你输出

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 

To stdout, you can pipe this data to another python program that generates the groups you want. 对于stdout,您可以将此数据传递给另一个生成所需组的python程序。

The second python program could look something like this: 第二个python程序看起来像这样:

import sys
import re


line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)

for line in sys.stdin:
    rowid, diff = re.findall('(\d+)', line)
    if diff != last_diff:
        print "rows", last_rowid, rowid, last_diff
        last_diff = diff
        last_rowid = rowid

print "rows", last_rowid, rowid, last_diff

You would execute them like this in a unix environment to get the output into a file: 您可以在unix环境中像这样执行它们以将输出输出到文件中:

python yourprogram.py | python myprogram.py > youroutputfile.dat

If you cannot run this on a unix environment, you can still use the algorithm I wrote in your program with a few modifications. 如果你无法在unix环境中运行它,你仍然可以使用我在你的程序中编写的算法进行一些修改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM